Friday, March 25, 2016

Caveat with ElasticSearch nGram tokenizer

Finally got some time to blog about ElasticSearch. I use it extensively during the last two years, but my findings are rather lengthy. Finally I've got something small to share.

ElasticSearch nGram tokernizer is very useful for efficient substring matching (at cost of index size of course). For example, I have an event message field like this

/dev/sda1 has failed due to ...
and I would like to find all events of failure for all SCSI disks. One option is to store message field as not analyzed string (i.e. a one single term) and use wildcard query:
GET /events
{
  "query": {
    "wildcard": {
      "message.raw": {
        "value": "/dev/sd?? has failed*"
      }
    }
  }
}
This will do the work perfectly, but to complete it, ElasticSearch will scan every value of message field looking for the pattern during search time. Once number of documents gets big enough, it will become slow.

One solution is to split message to substrings during indexing time, with (2,20) for (min, max) in our example:

# Analyzer definition in settings
"analysis": {
    "analyzer": {
        "substrings": {
            "tokenizer": "standard",
            "filter": ["lowercase", "thengram"]
        }   
    },  
    "filter": {
        "thengram": {
            "type": "nGram",
            "min_gram": 2,
            "max_gram": 20
        }   
    }   
} 

# message field definition in mappings
"message": {
    "type": "string",
    "index": "analyzed",
    "analyzer": "substrings"
}
and use match_phrase query:
GET events/_search
{
  "query": {
    "match": {
      "message": {
        "query": "dev sd has failed",
        "type": "phrase",
      }
    }
  }
}

The caveat

The above query will return weirdly unrelevant results and, at first glance, it's not obvious why. The caveat is, that our custom analyzer is applied both during indexing and search. So instead of searching for sequence of terms "dev", "sd", "has", "failed"; we are searching for sequence "de", "ev", "dev", "sd", "ha", "as", "has", etc. To fix this we need to tell Elastic to use different tokernizer during search (and search only). This can be done either by adding "analyzer": "standard" to query itself (which is error phone, since can be easily forgotten) or specified in mapping definition:
"message": {
    "type": "string",
    "index": "analyzed",
    "analyzer": "substrings",
    "search_analyzer": "standard"
}

Worth it?

I took 1,000,000 events sample data and run both wildcard and phrase queries that match 1,000 doc subset out of it. While for such a small data set, both are fast, the difference it quite striking nevertheless:
  • wildcard query - 30ms
  • phrase query - 5ms

Times 6 speed up! Another bonus for using phrase query is that you can get results highlighting (that is not supported for wildcard queries).