This is followup for my previous post about whether ElasticSearch master nodes should have persistent storage - they better do!. The rest of the post demonstrates how you can have spectacular data loss with ES if master nodes do not save their state to persistent storage.
The theory
Let's say you have the following cluster with single index (single primary shard). You also have an application that constantly writes data to the index
Now what happens if all your master nodes evaporate? Well, you relaunch them with clean disks. The moment masters are up, the cluster is red, since there are no data nodes, and your application can not index data.
Now data nodes start to join. In our example, the second one joins slightly before the first. What happens is that cluster becomes green, since fresh masters do not have any idea that there is other data node that has data and is about to join.
You application happily continues to index data, into newly created index on data node 2.
Now data nodes 1 joins - masters discover that they have some old version of our index and discard it. Data loss!!!
Sounds too esoteric to happen in real life? Here is sad&true story - back in a time we ran our ES master nodes in Kubernetes without persistent disk, i.e. on local EmptyDir volumes only. One day there was short network outage - for less than an hour. Kubelets lost connection to K8s master node and killed the pods. Once the network was back, the pods were started - with clean disk volumes! - and our application resumed running. The only catch is we've lost tons data :)
The reproduction
Let's try to simulate this in practice to see what happens. I'll use the minimal ES cluster by just running three ES instances on my laptop:
- 1 master node that also servers as a client node
- 2 data nodes. Lets call them
dnode1
anddnode2
Open three shells and lets go:
- Start the nodes - each in separate shell
Master:/usr/share/elasticsearch/bin/elasticsearch -Des.node.data=false -Des.node.master=true -Des.node.name=master-client --path.conf=/etc/elasticsearch --default.path.logs=/tmp/master-client/logs --default.path.data=/tmp/master-client
Data 01:/usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-01 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-01/logs --default.path.data=/tmp/data-01
Data 02:/usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
- Index a document:
curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas": 0}}' curl -XPUT 127.0.0.1:9200/users/user/1 -d '{"name": "Zaar"}'
- Check on which data node the index has landed. In my case, it was dnode2. Shutdown this data node and the master node (just hit
CTRL-C
in the shells) - Simulate master data loss by issuing
rm -rf /tmp/master-client/
- Bring master back (launch the same command)
- Index another document:
curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas":0}}' curl -XPUT 127.0.0.1:9200/users/user/2 -d '{"name": "Hai"}'
Now, while dnode2 is still down, we can see that index file exists in data directories of both nodes:
$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/ /tmp/data-01/elasticsearch/nodes/0/indices/: users /tmp/data-02/elasticsearch/nodes/0/indices/: users
However data on dnode2
is now in "Schrodinger's cat" state - neither dead, but not exactly alive either.
Let's bring back the node two and see what happens (I've also set gateway loglevel to TRACE in /etc/elasticsearch/logging.yml
for better visibility):
$ /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02 [2016-07-01 17:07:13,528][INFO ][node ] [data-02] version[2.3.3], pid[11826], build[218bdf1/2016-05-17T15:40:04Z] [2016-07-01 17:07:13,529][INFO ][node ] [data-02] initializing ... [2016-07-01 17:07:14,265][INFO ][plugins ] [data-02] modules [reindex, lang-expression, lang-groovy], plugins [kopf], sites [kopf] [2016-07-01 17:07:14,296][INFO ][env ] [data-02] using [1] data paths, mounts [[/ (/dev/mapper/kubuntu--vg-root)]], net usable_space [21.9gb], net total_space [212.1gb], spins? [no], types [ext4] [2016-07-01 17:07:14,296][INFO ][env ] [data-02] heap size [990.7mb], compressed ordinary object pointers [true] [2016-07-01 17:07:14,296][WARN ][env ] [data-02] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536] [2016-07-01 17:07:16,285][DEBUG][gateway ] [data-02] using initial_shards [quorum] [2016-07-01 17:07:16,513][DEBUG][indices.recovery ] [data-02] using max_bytes_per_sec[40mb], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true] [2016-07-01 17:07:16,563][TRACE][gateway ] [data-02] [upgrade]: processing [global-7.st] [2016-07-01 17:07:16,564][TRACE][gateway ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st] [2016-07-01 17:07:16,588][TRACE][gateway ] [data-02] state id [7] read from [global-7.st] [2016-07-01 17:07:16,589][TRACE][gateway ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st] [2016-07-01 17:07:16,598][TRACE][gateway ] [data-02] state id [1] read from [state-1.st] [2016-07-01 17:07:16,599][TRACE][gateway ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st] [2016-07-01 17:07:16,602][TRACE][gateway ] [data-02] state id [7] read from [global-7.st] [2016-07-01 17:07:16,602][TRACE][gateway ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st] [2016-07-01 17:07:16,604][TRACE][gateway ] [data-02] state id [1] read from [state-1.st] [2016-07-01 17:07:16,605][DEBUG][gateway ] [data-02] took 5ms to load state [2016-07-01 17:07:16,613][INFO ][node ] [data-02] initialized [2016-07-01 17:07:16,614][INFO ][node ] [data-02] starting ... [2016-07-01 17:07:16,714][INFO ][transport ] [data-02] publish_address {127.0.0.1:9302}, bound_addresses {[::1]:9302}, {127.0.0.1:9302} [2016-07-01 17:07:16,721][INFO ][discovery ] [data-02] elasticsearch/zcQx-01tRrWQuXli-eHCTQ [2016-07-01 17:07:19,848][INFO ][cluster.service ] [data-02] detected_master {master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}, added {{data-01}{FnGrtAwDSDSO2j_B53I4Xg}{127.0.0.1}{127.0.0.1:9301}{master=false},{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}]) [2016-07-01 17:07:19,868][TRACE][gateway ] [data-02] [_global] writing state, reason [changed] [2016-07-01 17:07:19,905][INFO ][node ] [data-02] started
At 17:07:16 we see the node found some data on it's own disk, but discarded it at 17:07:19 after joining the cluster. It's data dir is in fact empty:
$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/ /tmp/data-01/elasticsearch/nodes/0/indices/: users /tmp/data-02/elasticsearch/nodes/0/indices/:
Invoking stat
confirms that data directory was changed right after "writing state" message above:
$ stat /tmp/data-02/elasticsearch/nodes/0/indices/ File: ‘/tmp/data-02/elasticsearch/nodes/0/indices/’ Size: 4096 Blocks: 8 IO Block: 4096 directory Device: fc01h/64513d Inode: 1122720 Links: 2 Access: (0775/drwxrwxr-x) Uid: ( 1000/ haizaar) Gid: ( 1000/ haizaar) Access: 2016-07-01 17:08:39.093619141 +0300 Modify: 2016-07-01 17:07:19.920869352 +0300 Change: 2016-07-01 17:07:19.920869352 +0300 Birth: -
Conclusions
- Masters' cluster state is at least as important as data. Make sure your master node disks are backed up.
- If running on K8s - use persistent external volumes (GCEPersistentDisk if running on GKE).
- If possible, pause indexing after complete master outages until all of the data nodes come back.
This blog is a masterpiece. Thanks for sharing!
ReplyDeleteI'm glad it helped someone!
Delete