Thursday, July 14, 2016

You better have persistent storage for ElasticSearch master nodes

This is followup for my previous post about whether ElasticSearch master nodes should have persistent storage - they better do!. The rest of the post demonstrates how you can have spectacular data loss with ES if master nodes do not save their state to persistent storage.

The theory

Let's say you have the following cluster with single index (single primary shard). You also have an application that constantly writes data to the index

Now what happens if all your master nodes evaporate? Well, you relaunch them with clean disks. The moment masters are up, the cluster is red, since there are no data nodes, and your application can not index data.

Now data nodes start to join. In our example, the second one joins slightly before the first. What happens is that cluster becomes green, since fresh masters do not have any idea that there is other data node that has data and is about to join.

You application happily continues to index data, into newly created index on data node 2.

Now data nodes 1 joins - masters discover that they have some old version of our index and discard it. Data loss!!!

Sounds too esoteric to happen in real life? Here is sad&true story - back in a time we ran our ES master nodes in Kubernetes without persistent disk, i.e. on local EmptyDir volumes only. One day there was short network outage - for less than an hour. Kubelets lost connection to K8s master node and killed the pods. Once the network was back, the pods were started - with clean disk volumes! - and our application resumed running. The only catch is we've lost tons data :)

The reproduction

Let's try to simulate this in practice to see what happens. I'll use the minimal ES cluster by just running three ES instances on my laptop:

  • 1 master node that also servers as a client node
  • 2 data nodes. Lets call them dnode1 and dnode2

Open three shells and lets go:

  1. Start the nodes - each in separate shell
    Master:
    /usr/share/elasticsearch/bin/elasticsearch -Des.node.data=false -Des.node.master=true -Des.node.name=master-client --path.conf=/etc/elasticsearch --default.path.logs=/tmp/master-client/logs --default.path.data=/tmp/master-client
    
    Data 01:
    /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-01 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-01/logs --default.path.data=/tmp/data-01
    
    Data 02:
    /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02   --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
    
  2. Index a document:
    curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas": 0}}'
    curl -XPUT 127.0.0.1:9200/users/user/1 -d '{"name": "Zaar"}'
    
  3. Check on which data node the index has landed. In my case, it was dnode2. Shutdown this data node and the master node (just hit CTRL-C in the shells)
  4. Simulate master data loss by issuing rm -rf /tmp/master-client/
  5. Bring master back (launch the same command)
  6. Index another document:
    curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas":0}}'
    curl -XPUT 127.0.0.1:9200/users/user/2 -d '{"name": "Hai"}'
    

Now, while dnode2 is still down, we can see that index file exists in data directories of both nodes:

$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users

/tmp/data-02/elasticsearch/nodes/0/indices/:
users

However data on dnode2 is now in "Schrodinger's cat" state - neither dead, but not exactly alive either.

Let's bring back the node two and see what happens (I've also set gateway loglevel to TRACE in /etc/elasticsearch/logging.yml for better visibility):

$ /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02   --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
[2016-07-01 17:07:13,528][INFO ][node                     ] [data-02] version[2.3.3], pid[11826], build[218bdf1/2016-05-17T15:40:04Z]
[2016-07-01 17:07:13,529][INFO ][node                     ] [data-02] initializing ...
[2016-07-01 17:07:14,265][INFO ][plugins                  ] [data-02] modules [reindex, lang-expression, lang-groovy], plugins [kopf], sites [kopf]
[2016-07-01 17:07:14,296][INFO ][env                      ] [data-02] using [1] data paths, mounts [[/ (/dev/mapper/kubuntu--vg-root)]], net usable_space [21.9gb], net total_space [212.1gb], spins? [no], types [ext4]
[2016-07-01 17:07:14,296][INFO ][env                      ] [data-02] heap size [990.7mb], compressed ordinary object pointers [true]
[2016-07-01 17:07:14,296][WARN ][env                      ] [data-02] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-07-01 17:07:16,285][DEBUG][gateway                  ] [data-02] using initial_shards [quorum]
[2016-07-01 17:07:16,513][DEBUG][indices.recovery         ] [data-02] using max_bytes_per_sec[40mb], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true]
[2016-07-01 17:07:16,563][TRACE][gateway                  ] [data-02] [upgrade]: processing [global-7.st]
[2016-07-01 17:07:16,564][TRACE][gateway                  ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,588][TRACE][gateway                  ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,589][TRACE][gateway                  ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,598][TRACE][gateway                  ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,599][TRACE][gateway                  ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway                  ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway                  ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,604][TRACE][gateway                  ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,605][DEBUG][gateway                  ] [data-02] took 5ms to load state
[2016-07-01 17:07:16,613][INFO ][node                     ] [data-02] initialized
[2016-07-01 17:07:16,614][INFO ][node                     ] [data-02] starting ...
[2016-07-01 17:07:16,714][INFO ][transport                ] [data-02] publish_address {127.0.0.1:9302}, bound_addresses {[::1]:9302}, {127.0.0.1:9302}
[2016-07-01 17:07:16,721][INFO ][discovery                ] [data-02] elasticsearch/zcQx-01tRrWQuXli-eHCTQ
[2016-07-01 17:07:19,848][INFO ][cluster.service          ] [data-02] detected_master {master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}, added {{data-01}{FnGrtAwDSDSO2j_B53I4Xg}{127.0.0.1}{127.0.0.1:9301}{master=false},{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}])
[2016-07-01 17:07:19,868][TRACE][gateway                  ] [data-02] [_global] writing state, reason [changed]
[2016-07-01 17:07:19,905][INFO ][node                     ] [data-02] started

At 17:07:16 we see the node found some data on it's own disk, but discarded it at 17:07:19 after joining the cluster. It's data dir is in fact empty:

$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users

/tmp/data-02/elasticsearch/nodes/0/indices/:

Invoking stat confirms that data directory was changed right after "writing state" message above:

$ stat /tmp/data-02/elasticsearch/nodes/0/indices/
  File: ‘/tmp/data-02/elasticsearch/nodes/0/indices/’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fc01h/64513d    Inode: 1122720     Links: 2
Access: (0775/drwxrwxr-x)  Uid: ( 1000/ haizaar)   Gid: ( 1000/ haizaar)
Access: 2016-07-01 17:08:39.093619141 +0300
Modify: 2016-07-01 17:07:19.920869352 +0300
Change: 2016-07-01 17:07:19.920869352 +0300
 Birth: -

Conclusions

  • Masters' cluster state is at least as important as data. Make sure your master node disks are backed up.
  • If running on K8s - use persistent external volumes (GCEPersistentDisk if running on GKE).
  • If possible, pause indexing after complete master outages until all of the data nodes come back.