We are building one of our products on a cloud and decided to run it entirely on Kubernetes cluster. One of the big pains that is relieved by containers is resource separation between different processes (modules) of your system. Let's say we have a product that comprises of several services that talk to each other ("microservices" as it is now fashionably called). Before containers, or, to be more precise, before Linux kernel control groups were introduced, we had several options to try to ensure that they do not step on each other:
- Run each microservice on a separate VM, which is usually wasteful
- Play with CPU affinity for each microservice, on each VM - this saves you only from CPU hogs, but not from memory leeches, fork bombs, I/O swappers, etc.
This is where containers come into play - this allows you share your machine between different applications by allocating required portion of resources for each of them.
Back to KubernetesKubernetes supports defining limit enforcement on two resource types: CPU and RAM. For each container you can provide requested, i.e. minimum required, amount of CPU and memory and a limit that container should not pass. Requested is also used for pod scheduling to ensure that a node will provide minimum amount of resources that pod requested. All these parameters are of course translated to docker parameters under the hood.
Since Kubernetes is quite a new gorilla in the block, I decided to test how enforcement behaves to get first hand experience with it.
So first I created a container cluster on GKE with Kubernetes 1.1.8:
gcloud container clusters create limits-test --machine-type n1-highcpu-4 --num-nodes 1
Now lets see what we got on our node (scroll right):
$ kubectl describe nodes Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits ───────── ──── ──────────── ────────── ─────────────── ───────────── kube-system fluentd-cloud-logging-gke-limits-test-aec280e3-node-2tdw 100m (2%) 100m (2%) 200Mi (5%) 200Mi (5%) kube-system heapster-v11-9rqvl 100m (2%) 100m (2%) 212Mi (5%) 212Mi (5%) kube-system kube-dns-v9-kbzpd 310m (7%) 310m (7%) 170Mi (4%) 170Mi (4%) kube-system kube-ui-v4-7q12m 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%) kube-system l7-lb-controller-v0.5.2-imjry 110m (2%) 110m (2%) 70Mi (1%) 120Mi (3%) Allocated resources: (Total limits may be over 100%, i.e., overcommitted...) CPU Requests CPU Limits Memory Requests Memory Limits ──────────── ────────── ─────────────── ───────────── 720m (18%) 720m (18%) 702Mi (19%) 752Mi (21%)
That's quite interesting already - the minimal resource overhead of Kubernetes is 720 millicores of CPU and 702 megabytes of RAM (not including
kube-proxy of course). However second node and on will only run one daemon pod -
fluentd for log collection, so the resource reservation will be significantly lower.
CPUKubernetes defines CPU resource as compressible, i.e. a pod can get larger part of CPU share if there is available CPU and this can be changed back on the fly, without process restart/kill.
I've created a simple CPU loader that calculates squares of integers from 1 to 1000 in loop on every core and prints loops/seconds number; packaged it into a docker image and launched into k8s using the following pod file:
apiVersion: v1 kind: Pod metadata: name: cpu-small spec: containers: - image: docker.io/haizaar/cpu-loader:1.1 name: cpu-small resources: requests: cpu: "500m"
I've created another pod similar to this - just called it
cpu-large. Attaching to pods shortly afterwards, I saw that they get a fair share of CPU:
$ kubectl attach cpu-small 13448 loops/sec 13841 loops/sec 13365 loops/sec 13818 loops/sec 14937 loops/sec $ kubectl attach cpu-large 14615 loops/sec 14448 loops/sec 14089 loops/sec 13755 loops/sec 14267 loops/sec
That makes sense - they both requested only .5 cores and the rest was split between them, since nobody else was interested.
So in total this ode can crunch ~30k loops/second. Now lets make
cpu-large to be really large and reserve at least 2.5 cores for it by changing its
requests.cpu to 2500m and re-launching it into k8s. According to our settings, this pod now should be able to crunch at least ~25k loops/sec:
$ kubectl attach cpu-large 23310 loops/sec 23000 loops/sec 25822 loops/sec 23834 loops/sec 25153 loops/sec 24741 loops/sec
And this is indeed the case. Lets see what happened to
$ kubectl attach cpu-small 30091 loops/sec 28609 loops/sec 30219 loops/sec 27051 loops/sec 27885 loops/sec 29091 loops/sec 28699 loops/sec 18216 loops/sec 4213 loops/sec 4188 loops/sec 4296 loops/sec 4347 loops/sec 4141 loops/sec
First it got all of the CPU while I was re-launching
cpu-large, but once the latter was up, the CPU share for
cpu-small was reduced. Together they will produce the same ~30k loops/second, but we now control the share ratio.
What about limits? Well, turns out that currently limits are not enforced. This is not a big problem for us, because in our deployment strategy we prefer to provide minimum required CPU share for every pod and for the rest - be my guest. However at this point I was glad I did this test, since the documentation was misleading with regards to CPU limits.
RAMThe RAM resource is uncompressible, because there is no way to throttle process on memory usage or ask it gently to unmalloc some of it. That's why if a process reaches RAM limit, it's simply killed.
To see how it's enforced in practice, I, again, created a simple script that allocates memory in chunks up to predefined layout.
First I've tested how
requests.memory are enforced. I've created the following
apiVersion: v1 kind: Pod metadata: name: mem-small spec: containers: - image: docker.io/haizaar/mem-loader:1.2 name: mem-small resources: requests: memory: "100Mi" env: - name: MAXMEM value: "2147483648"
and launched it. I happily allocated 2GB of RAM and stood by. Then I created
mem-large pod with similar configuration where
requests.memory is set to "2000Mi". After I launched the large pod, the following happened:
cpu-largestarted allocating the desired 2GB RAM.
- Since my k8s node only had 3.6GB RAM, system froze for dozen seconds or so.
- Since there was no free memory in the system, kernel Out Of Memory Killer kicked in and killed
[ 609.739039] Out of memory: Kill process 5410 (python) score 1270 or sacrifice child [ 609.746918] Killed process 5410 (python) total-vm:1095580kB, anon-rss:1088056kB, file-rss:0kB
I.e. enforcement took place and my small pod was killed, since it consumed more RAM than requested and other pod was eligibly requesting memory. However such behavior is unsuitable in practice since it causes "stop-the-world" effect for everything that runs on particular k8s node.
Now lets see how
resource.limits are enforced. To verify that, I've killed by of my pods, and changed
mem-small as follows:
apiVersion: v1 kind: Pod metadata: name: mem-small spec: containers: - image: docker.io/haizaar/mem-loader:1.2 name: mem-small resources: requests: memory: "100Mi" limits: memory: "100Mi" env: - name: MAXMEM value: "2147483648"
After launching it I saw the following on it's output:
Reached 94 megabytes Reached 95 megabytes Reached 96 megabytes Reached 97 megabytes Reached 98 megabytes Reached 99 megabytes Reached 99 megabytes Reached 99 megabytes Killed
I.e. The process was immediately killed after reaching its RAM limit. There is a nice evidence to that in
[ 898.665335] Task in /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea killed as a result of limit of /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea [ 898.689794] memory: usage 102400kB, limit 102400kB, failcnt 612 [ 898.697490] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0 [ 898.705930] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0 [ 898.713672] Memory cgroup stats for /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea: cache:84KB rss:102316KB rss_huge:0KB mapped_file:4KB writeback:0KB inactive_anon:4KB active_anon:102340KB inactive_file:20KB active_file:16KB unevictable:0KB [ 898.759180] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 898.768961] [ 6679] 0 6679 377 1 6 0 -999 sh [ 898.778387] [ 6683] 0 6683 27423 25682 57 0 -999 python [ 898.788280] Memory cgroup out of memory: Kill process 6683 (python) score 29 or sacrifice child
ConclusionsKubernetes documentation is a bit misleading with regards to
requests.limits.cpu. Nevertheless this mechanism looks perfectly useful for application. All of the code and configuration used in this post is available in the following gists: