How to get exact text match from elasticsearch if the query is between quotes

Tags: ,



I implemented elasticsearch using php for binary documents (fscrawler). It works just fine with the default settings. I can search the documents for the word I want and I get results that are case insensitive. However, I now want to do exact matches i.e on top of the current search, if the query is enclosed in quotes, I want to get results that only match the query exactly.. even case sensitive.

My mapping looks like this:

"settings": {
"number_of_shards": 1,
"index.mapping.total_fields.limit": 2000,
"analysis": {
  "analyzer": {
    "fscrawler_path": {
      "tokenizer": "fscrawler_path"
    }
  },
  "tokenizer": {
    "fscrawler_path": {
      "type": "path_hierarchy"
    }
  }
}
.
.
.
  "content": {
    "type": "text",
    "index": true
  },

My query for the documents looks like this:

    if ($q2 == '') {
    $params = [
        'index' => 'trial2',
        'body' => [
            'query' => [
                'match_phrase' => [
                        'content' => $q
                ]
            ]
        ]
    ];

    $query = $client->search($params);
    $data['q'] = $q;
}

For exact matches(does not work):

    if ($q2 == '') {
        $params = [
            'index' => 'trial2',
            'body' => [
                'query' => [
                    'filter' =>[
                        'term' => [
                            'content' => $q
                        ]
                    ]
                ]
            ]
        ];

        $query = $client->search($params);
        $data['q'] = $q;
    }

content field is the body of the document. How do I implement the exact match for a specific word or phrase in the content field?

Answer

Your content field, what I understand, would be significantly large as many documents may be more than 2-3 MB and that’s a lot of words.

There’d be no point in using keyword field in order to do exact match as per the answer to your earlier question where I referred to using keyword. You should use keyword datatype for exact match only if your data is structured

What I understand is the content field you have is unstructured. In that case you would want to make use of Whitespace Analyzer on your content field.

Also for exact phrase match you make take a look at Match Phrase query.

Below is a sample index, documents and queries that would suffice your use case.

Mapping:

PUT mycontent_index
{
  "mappings": {
    "properties": {
      "content":{
        "type":"text",
        "analyzer": "whitespace"            <----- Note this
      }
    }
  }
}

Sample Documents:

POST mycontent_index/_doc/1
{
  "content": """
      There is no pain you are receding
      A distant ship smoke on the horizon
      You are only coming through in waves
      Your lips move but I can't hear what you're saying
  """
}

POST mycontent_index/_doc/2
{
  "content": """          
      there is no pain you are receding
      a distant ship smoke on the horizon
      you are only coming through in waves
      your lips move but I can't hear what you're saying
  """
}

Phrase Match:(To search a sentence with words in order)

POST mycontent_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {                   <---- Note this for phrase match
            "content": "There is no pain"
          }
        }
      ]
    }
  }
}

Match Query:

POST mycontent_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {                          <---- Use this for token based search
            "content": "there"
          }
        }
      ]
    }
  }
}

Note that your response should be accordingly.

For exact match for a word, just use a simple Match query.

Note that when you do not specify any analyzer, ES by default uses Standard Analyzer and this would cause all the tokens to be converted into lower case before storing them in Inverted Index. However, Whitespace Analyzer would not convert the tokens into lower case. As a result There and there are stored as two different tokens in your ES index.

I’m assuming you are aware of Analysis and Analyzer concepts and if not I’d suggest you to go through the links as that will help you know more on what I’m talking about.

Updated Answer:

Post understanding your requirements, there is no way you can apply multiple analyzers on a single field, so basically you have two options:

Option 1: Use multiple indexes

Option 2: Use multi-field in your mapping as shown below:

That way, your script or service layer would have the logic of pushing to different index or field depending on your input value(ones having double inverted comma and ones that are simple tokens)

Multi Field Mapping:

PUT <your_index_name>
{ 
   "mappings":{ 
      "properties":{ 
         "content":{ 
            "type":"text",                     <--- Field with standard analyzer
            "fields":{ 
               "whitespace":{ 
                  "type":"text",               <--- Field with whitespace
                  "analyzer":"whitespace"       
               }
            }
         }
      }
   }
}

Ideally, I would prefer to have the first solution i.e making use of multiple indexes with different mapping, however I would strongly advise you to revisit your use-case because it doesn’t make sense in managing querying like this but again its your call.

Note: A cluster of single node that’s the worst possible option you can ever do and specially not for Production.

I’d suggest you ask that in separate question detailing your docs count, growth rate over next 5 years or something and would your use case be more read heavy or write intensive? Is that cluster something other teams may also would want to leverage. I’d suggest you to read more and discuss with your team or manager to get more clarity on your scenarios.

Hope this helps.



Source: stackoverflow