Tuesday, April 26, 2016

How Kubernetes applies resource limits

We are building one of our products on a cloud and decided to run it entirely on Kubernetes cluster. One of the big pains that is relieved by containers is resource separation between different processes (modules) of your system. Let's say we have a product that comprises of several services that talk to each other ("microservices" as it is now fashionably called). Before containers, or, to be more precise, before Linux kernel control groups were introduced, we had several options to try to ensure that they do not step on each other:

  • Run each microservice on a separate VM, which is usually wasteful
  • Play with CPU affinity for each microservice, on each VM - this saves you only from CPU hogs, but not from memory leeches, fork bombs, I/O swappers, etc.

This is where containers come into play - this allows you share your machine between different applications by allocating required portion of resources for each of them.

Back to Kubernetes

Kubernetes supports defining limit enforcement on two resource types: CPU and RAM. For each container you can provide requested, i.e. minimum required, amount of CPU and memory and a limit that container should not pass. Requested is also used for pod scheduling to ensure that a node will provide minimum amount of resources that pod requested. All these parameters are of course translated to docker parameters under the hood.

Since Kubernetes is quite a new gorilla in the block, I decided to test how enforcement behaves to get first hand experience with it.

So first I created a container cluster on GKE with Kubernetes 1.1.8:

gcloud container clusters create limits-test --machine-type n1-highcpu-4 --num-nodes 1

Now lets see what we got on our node (scroll right):

$ kubectl describe nodes
Non-terminated Pods:            (5 in total)
  Namespace                     Name                                                                    CPU Requests    CPU Limits      Memory Requests Memory Limits
  ─────────                     ────                                                                    ────────────    ──────────      ─────────────── ─────────────
  kube-system                   fluentd-cloud-logging-gke-limits-test-aec280e3-node-2tdw                100m (2%)       100m (2%)       200Mi (5%)      200Mi (5%)
  kube-system                   heapster-v11-9rqvl                                                      100m (2%)       100m (2%)       212Mi (5%)      212Mi (5%)
  kube-system                   kube-dns-v9-kbzpd                                                       310m (7%)       310m (7%)       170Mi (4%)      170Mi (4%)
  kube-system                   kube-ui-v4-7q12m                                                        100m (2%)       100m (2%)       50Mi (1%)       50Mi (1%)
  kube-system                   l7-lb-controller-v0.5.2-imjry                                           110m (2%)       110m (2%)       70Mi (1%)       120Mi (3%)
Allocated resources:
  (Total limits may be over 100%, i.e., overcommitted...)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ────────────  ──────────      ─────────────── ─────────────
  720m (18%)    720m (18%)      702Mi (19%)     752Mi (21%)

That's quite interesting already - the minimal resource overhead of Kubernetes is 720 millicores of CPU and 702 megabytes of RAM (not including kubelet and kube-proxy of course). However second node and on will only run one daemon pod - fluentd for log collection, so the resource reservation will be significantly lower.

CPU

Kubernetes defines CPU resource as compressible, i.e. a pod can get larger part of CPU share if there is available CPU and this can be changed back on the fly, without process restart/kill.

I've created a simple CPU loader that calculates squares of integers from 1 to 1000 in loop on every core and prints loops/seconds number; packaged it into a docker image and launched into k8s using the following pod file:

apiVersion: v1
kind: Pod
metadata:
  name: cpu-small
spec:
  containers:
  - image: docker.io/haizaar/cpu-loader:1.1
    name: cpu-small
    resources:
      requests:
        cpu: "500m"

I've created another pod similar to this - just called it cpu-large. Attaching to pods shortly afterwards, I saw that they get a fair share of CPU:

$ kubectl attach cpu-small
13448 loops/sec
13841 loops/sec
13365 loops/sec
13818 loops/sec
14937 loops/sec

$ kubectl attach cpu-large
14615 loops/sec
14448 loops/sec
14089 loops/sec
13755 loops/sec
14267 loops/sec

That makes sense - they both requested only .5 cores and the rest was split between them, since nobody else was interested. So in total this ode can crunch ~30k loops/second. Now lets make cpu-large to be really large and reserve at least 2.5 cores for it by changing its requests.cpu to 2500m and re-launching it into k8s. According to our settings, this pod now should be able to crunch at least ~25k loops/sec:

$ kubectl attach cpu-large
23310 loops/sec
23000 loops/sec
25822 loops/sec
23834 loops/sec
25153 loops/sec
24741 loops/sec

And this is indeed the case. Lets see what happened to cpu-small:

$ kubectl attach cpu-small
30091 loops/sec
28609 loops/sec
30219 loops/sec
27051 loops/sec
27885 loops/sec
29091 loops/sec
28699 loops/sec
18216 loops/sec
4213 loops/sec
4188 loops/sec
4296 loops/sec
4347 loops/sec
4141 loops/sec

First it got all of the CPU while I was re-launching cpu-large, but once the latter was up, the CPU share for cpu-small was reduced. Together they will produce the same ~30k loops/second, but we now control the share ratio.

What about limits? Well, turns out that currently limits are not enforced. This is not a big problem for us, because in our deployment strategy we prefer to provide minimum required CPU share for every pod and for the rest - be my guest. However at this point I was glad I did this test, since the documentation was misleading with regards to CPU limits.

RAM

The RAM resource is uncompressible, because there is no way to throttle process on memory usage or ask it gently to unmalloc some of it. That's why if a process reaches RAM limit, it's simply killed.

To see how it's enforced in practice, I, again, created a simple script that allocates memory in chunks up to predefined layout.

First I've tested how requests.memory are enforced. I've created the following mem-small pod:

apiVersion: v1
kind: Pod
metadata:
  name: mem-small
spec:
  containers:
  - image: docker.io/haizaar/mem-loader:1.2
    name: mem-small
    resources:
      requests:
        memory: "100Mi"
    env:
    - name: MAXMEM
      value: "2147483648"

and launched it. I happily allocated 2GB of RAM and stood by. Then I created mem-large pod with similar configuration where requests.memory is set to "2000Mi". After I launched the large pod, the following happened:

  • cpu-large started allocating the desired 2GB RAM.
  • Since my k8s node only had 3.6GB RAM, system froze for dozen seconds or so.
  • Since there was no free memory in the system, kernel Out Of Memory Killer kicked in and killed mem-small pod:
[  609.739039] Out of memory: Kill process 5410 (python) score 1270 or sacrifice child
[  609.746918] Killed process 5410 (python) total-vm:1095580kB, anon-rss:1088056kB, file-rss:0kB

I.e. enforcement took place and my small pod was killed, since it consumed more RAM than requested and other pod was eligibly requesting memory. However such behavior is unsuitable in practice since it causes "stop-the-world" effect for everything that runs on particular k8s node.

Now lets see how resource.limits are enforced. To verify that, I've killed by of my pods, and changed mem-small as follows:

apiVersion: v1
kind: Pod
metadata:
  name: mem-small
spec:
  containers:
  - image: docker.io/haizaar/mem-loader:1.2
    name: mem-small
    resources:
      requests:
        memory: "100Mi"
      limits:
        memory: "100Mi"
    env:
    - name: MAXMEM
      value: "2147483648"

After launching it I saw the following on it's output:

Reached 94 megabytes
Reached 95 megabytes
Reached 96 megabytes
Reached 97 megabytes
Reached 98 megabytes
Reached 99 megabytes
Reached 99 megabytes
Reached 99 megabytes
Killed

I.e. The process was immediately killed after reaching its RAM limit. There is a nice evidence to that in dmesg output:

[  898.665335] Task in /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea killed as a result of limit of /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea
[  898.689794] memory: usage 102400kB, limit 102400kB, failcnt 612
[  898.697490] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
[  898.705930] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[  898.713672] Memory cgroup stats for /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea: cache:84KB rss:102316KB rss_huge:0KB mapped_file:4KB writeback:0KB inactive_anon:4KB active_anon:102340KB inactive_file:20KB active_file:16KB unevictable:0KB
[  898.759180] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  898.768961] [ 6679]     0  6679      377        1       6        0          -999 sh
[  898.778387] [ 6683]     0  6683    27423    25682      57        0          -999 python
[  898.788280] Memory cgroup out of memory: Kill process 6683 (python) score 29 or sacrifice child

Conclusions

Kubernetes documentation is a bit misleading with regards to requests.limits.cpu. Nevertheless this mechanism looks perfectly useful for application. All of the code and configuration used in this post is available in the following gists: