ElasticSearch nGram tokernizer is very useful for efficient substring matching (at cost of index size of course). For example, I have an event message field like this
/dev/sda1 has failed due to ...and I would like to find all events of failure for all SCSI disks. One option is to store message field as not analyzed string (i.e. a one single term) and use
wildcard
query:
GET /events { "query": { "wildcard": { "message.raw": { "value": "/dev/sd?? has failed*" } } } }This will do the work perfectly, but to complete it, ElasticSearch will scan every value of message field looking for the pattern during search time. Once number of documents gets big enough, it will become slow.
One solution is to split message to substrings during indexing time, with (2,20) for (min, max) in our example:
# Analyzer definition in settings "analysis": { "analyzer": { "substrings": { "tokenizer": "standard", "filter": ["lowercase", "thengram"] } }, "filter": { "thengram": { "type": "nGram", "min_gram": 2, "max_gram": 20 } } } # message field definition in mappings "message": { "type": "string", "index": "analyzed", "analyzer": "substrings" }and use
match_phrase
query:
GET events/_search { "query": { "match": { "message": { "query": "dev sd has failed", "type": "phrase", } } } }
The caveat
The above query will return weirdly unrelevant results and, at first glance, it's not obvious why. The caveat is, that our custom analyzer is applied both during indexing and search. So instead of searching for sequence of terms "dev", "sd", "has", "failed"; we are searching for sequence "de", "ev", "dev", "sd", "ha", "as", "has", etc. To fix this we need to tell Elastic to use different tokernizer during search (and search only). This can be done either by adding"analyzer": "standard"
to query itself (which is error phone, since can be easily forgotten) or specified in mapping definition:
"message": { "type": "string", "index": "analyzed", "analyzer": "substrings", "search_analyzer": "standard" }
Worth it?
I took 1,000,000 events sample data and run bothwildcard
and phrase
queries that match 1,000 doc subset out of it. While for such a small data set, both are fast, the difference it quite striking nevertheless:
wildcard
query - 30ms-
phrase
query - 5ms
Times 6 speed up! Another bonus for using phrase
query is that you can get results highlighting (that is not supported for wildcard
queries).
No comments:
Post a Comment