Tuesday, April 5, 2016

Persistent storage for ElasticSearch master nodes?

ElasticSearch master nodes hold cluster state. I was trying to understand whether these nodes are required to have persistent storage or they can recover from whatever exists on data nodes? The short answer is: probably yes.

Update

You better have persistent disk for your ES data nodes - read here why.

Below I describe tests what I've done. But before that - some background on how did I get to this question at first place.

ElasticSearch on Kubernetes

We are working on running ElasticSearch 2.x on Kubernetes on Google Container Engine. There are two options to store a data for a container:
EmptyDir
Part of the local storage on Kubernetes node is allocated for the pod. If pod's controller restarts, the data survives. If pod is killed - the data is lost.
gcePersistentDisk
Compute engine persistent disk (created in advance) can be attached to a pod. The data persists. However there is a limitation - as of Kubernetes 1.2, a ReplicaSet can not attach a different disks to each pods that it creates, thus to run ElasticSearch data nodes, for example, you need to create a separate ReplicaSet (with size of 1) for each ES data node.

ES data nodes should have persistent disk - this is no brainer. However with regards to ES master nodes it's not clear. I've tried to understand where master nodes persist cluster state, and this thread states "on every node including client nodes". There is also a resolved issue about storing index metadata on data nodes.

Run, Kill, Repeat

So lets how it behaves in reality.

I created two node Kubernetes 1.2 cluster running n1-standard-2 instances (2 CPUs, 7.5GB RAM). And used Paulo Pires Kubernetes setup for ElasticSearch:

$ git clone https://github.com/pires/kubernetes-elasticsearch-cluster.git
$ cd kubernetes-elasticsearch-cluster
$ vim es-data-rc.yaml  # set replicas to 2
$ vim es-master-rc.yaml  # set replicas to 3
$ vim es-svc.yaml  # set type to ClusterIP

Lets launch it in the air:

$ for i in *.yaml; do kubectl create -f $i; done
$ sleep 1m; kubectl get pods
NAME              READY     STATUS    RESTARTS   AGE
es-client-ats2b   1/1       Running   0          1h
es-data-teodq     1/1       Running   0          1h
es-data-zwml2     1/1       Running   0          1h
es-master-3bosq   1/1       Running   0          1h
es-master-a47om   1/1       Running   0          1h
es-master-c1dy1   1/1       Running   0          1h

We are all good. Lets ingest some data and alter cluster settings:

$ CLIENTIP=$(kubectl describe pods es-client |grep '^IP' |head -n 1|awk '{print $2}')
$ curl -XPUT $CLIENTIP:9200/_cluster/settings?pretty -d '{"transient": {"discovery.zen.minimum_master_nodes": 2}}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  }
}
$ curl -XPUT $CLIENTIP:9200/tweets/tweet/1 -d '{"foo": "bar"}'          
{
  "_id": "1",
  "_index": "tweets",
  "_shards": {
    "failed": 0,
    "successful": 2,
    "total": 2
  },
  "_type": "tweet",
  "_version": 1,
  "created": true
}
$ curl $CLIENTIP:9200/_cluster/health?pretty
{
  "cluster_name" : "myesdb",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The data is there and the cluster is green. Now lets kill all of the masters and recreate them (their data disk will be lost):

$ kubectl delete -f es-master-rc.yaml 
replicationcontroller "es-master" deleted
$ kubectl create -f es-master-rc.yaml       
replicationcontroller "es-master" created

After dozens of seconds, new masters will be up again and we'll see the following in the leader's log:

[2016-04-04 15:57:44,880][INFO ][cluster.service          ] [Elaine Grey] new_master {Elaine Grey}{5NIL5jBbTYadefGzjDLb5A}{10.224.1.7}{10.224.1.7:9300}{data=false, master=true}, added {{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false},{Typhoid Mary}{cWmlEtHuSdCImjHMNM6FsA}{10.224.1.3}{10.224.1.3:9300}{master=false},{Slug}{fQPe2C1FSH2UkuveBFJtbw}{10.224.1.5}{10.224.1.5:9300}{data=false, master=false},}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-04-04 15:57:45,056][INFO ][node                     ] [Elaine Grey] started
[2016-04-04 15:57:45,058][INFO ][cluster.service          ] [Elaine Grey] added {{Stacy X}{U4sn1pkWRlGV-zVMW1OeAA}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true},}, reason: zen-disco-join(pending joins after accumulation stop [election closed])
[2016-04-04 15:57:45,711][INFO ][gateway                  ] [Elaine Grey] recovered [0] indices into cluster_state
[2016-04-04 15:57:45,712][INFO ][cluster.service          ] [Elaine Grey] added {{Kiss}{MaNkKlQWR82QHFVhz38Ohg}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true},}, reason: zen-disco-join(join from node[{Kiss}{MaNkKlQWR82QHFVhz38Ohg}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true}])
[2016-04-04 15:57:46,077][INFO ][gateway                  ] [Elaine Grey] auto importing dangled indices [tweets/OPEN] from [{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false}]
[2016-04-04 15:57:47,073][INFO ][cluster.routing.allocation] [Elaine Grey] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[tweets][3]] ...]).
[2016-04-04 15:57:47,567][INFO ][cluster.routing.allocation] [Elaine Grey] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[tweets][3]] ...]).
[2016-04-04 15:58:16,355][INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Elaine Grey] updating discovery.zen.minimum_master_nodes from [-1] to [2]

So we see that:

  • The new master recovered 0 indices from cluster state - i.e. the cluster state was indeed lost.
  • The new master auto imported existing indices, which were "dangling" in ES terminology.
  • It also restored our transient quorum setting

And our cluster is green:


$ curl $CLIENTIP:9200/_cluster/health?pretty
{
  "cluster_name" : "myesdb",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Now lets kill master without destroying their disks and see it behaves any differently:

$ for pod in $(kubectl get pods |grep es-master |awk '{print $1}'); do do kubectl exec $pod killall java & done
The master log shows the following:
[2016-04-04 16:33:55,695][INFO ][cluster.service          ] [Neptune] new_master {Neptune}{51QU1wf4T2Ky9QUzK5MkEw}{10.224.1.7}{10.224.1.7:9300}{data=false, master=true}, added {{Slug}{fQPe2C1FSH2UkuveBFJtbw}{10.224.1.5}{10.224.1.5:9300}{data=false, master=false},{Typhoid Mary}{cWmlEtHuSdCImjHMNM6FsA}{10.224.1.3}{10.224.1.3:9300}{master=false},{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false},{Comet Man}{cOLRxsOKTC2OYR6Kuiplxw}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true},}, reason: zen-disco-join(elected_as_master, [1] joins received)
[2016-04-04 16:33:55,867][INFO ][node                     ] [Neptune] started
[2016-04-04 16:33:56,428][INFO ][gateway                  ] [Neptune] recovered [1] indices into cluster_state
[2016-04-04 16:33:57,233][INFO ][cluster.routing.allocation] [Neptune] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[tweets][0], [tweets][0]] ...]).
[2016-04-04 16:33:57,745][INFO ][cluster.routing.allocation] [Neptune] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[tweets][4]] ...]).
[2016-04-04 16:34:01,417][INFO ][cluster.service          ] [Neptune] added {{Nicole St. Croix}{5VcQT4H8RHeAf3R0PW3K4A}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true},}, reason: zen-disco-join(join from node[{Nicole St. Croix}{5VcQT4H8RHeAf3R0PW3K4A}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true}])
[2016-04-04 17:01:55,811][INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Neptune] updating discovery.zen.minimum_master_nodes from [-1] to [2]

So 1 index was recovered from the cluster state and there are no dangling indices this time.

Conclusions

While master nodes were able to recover both indices and cluster transient settings, I was testing only the most simple scenario. This is not enough to take a decision on whether we should maintain persistent disks for master nodes. On the other hand, if we do have persistent disks for master - do we need to backup the metadata? And what about ES 5.x? - one of the promised features is that master will hold an ID of the latest index change to prevent stale data nodes becoming primaries during network partitioning. This kind of metadata can not be stored on data nodes.

I'll update this post when I'll have the answers.