Tuesday, September 18, 2018

Reducing docker image sizes

Several years ago I started a new greenfield project. Based on the state of technology affairs back then, I decided to go full time with container technology - Docker and Kubernetes. We head dived into all of the new technologies and had our application started pretty fast. Back then the majority of Docker Library was based on Debian and it resulted in quite a large images - our average Python app container image weights about 700-1000MB. Finally the time has come to rectify it.

Why do you care

Docker images are not pulled too often and 1GB is not too big of a number in the age of clouds, so why do you care? Your mileage may vary, but these are our reasons:
  • Image pull speed - on GCE it takes about 30 seconds to pull 1GB image. While downloads when pulling from GCR are almost instant, extraction takes a notable amount of time. When a GKE node crashes and pods migrate to other nodes, the image pull time adds to your application downtime. To compare - pulling of 40MB coredns image from GCR takes only 1.3 seconds.
  • Disk space on GKE nodes - when you have lots of containers and update them often, you may end up with disk space pressure. Same goes for developers' laptops.
  • Deploying off-cloud - pulling gigabytes of data is no fun when you try that over saturated 4G network during a conference demo.

Here are the strategies current available on the market.

Use alpine based images

Sounds trivial right? - they are around for quite some time already and the majority of the Docker Library has an -alpine variant. But not all alpine images were born the same:

Docker Library alpine variant of Python:


$ docker pull python3.6-alpine
$ docker images python:3.6-alpine --format '{{.Size}}'
74.2MB

DIY alpine Python:


$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
$ docker build -t alpython .
$ docker images alpython --format '{{.Size}}'
56.2MB

This is %25 size reduction compared to Docker Library Python!

Note: There is another "space-saving" project that provides a bit different approach - instead of providing a complete Linux distro, albeit a smaller one, they provide a minimal runtime base image for each Language. Have a look at Distroless.

Avoid unnecessary layers

It's quite natural to write your Dockerfile as follows:

FROM alpine:3.8
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base
RUN rm -rf hello.c

It provides nice reuse of layers cache, e.g. if hello.c changes, then we can still reuse installation of build-base package from cache. There is one problem through - in the above example, the resulting image weights 157MB(!) through actual hello binary is just 10KB:


$ cat hello.c 
#include <stdio.h>
#include <stdlib.h>

int main() {
 printf("Hello world!\n");
 return EXIT_SUCCESS;
}
$ docker build -t layers .
$ docker images layers --format '{{.Size}}'
157MB
$ docker run --rm -ti layers ls -lah /hello
-rwxr-xr-x    1 root     root       10.4K Sep 18 10:45 /hello

The reason is that each line in Dockerfile produces a new layer that constitutes a part of the image, even-through the final FS layout may not contain all of the files. You can see the hidden "convicts" using docker history:


$ docker history layers
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
8c85cd4cd954        16 minutes ago      /bin/sh -c rm -rf hello.c                       0B                  
b0f981eae17a        17 minutes ago      /bin/sh -c apk del build-base                   20.6kB              
5f5c41aaddac        17 minutes ago      /bin/sh -c gcc -Wall -o hello hello.c           10.6kB              
e820eacd8a70        18 minutes ago      /bin/sh -c #(nop) COPY file:380754830509a9a2…   104B                
0617b2ee0c0b        18 minutes ago      /bin/sh -c apk add --no-cache build-base        153MB               
196d12cf6ab1        6 days ago          /bin/sh -c #(nop)  CMD ["/bin/sh"]              0B                  
<missing>           6 days ago          /bin/sh -c #(nop) ADD file:25c10b1d1b41d46a1…   4.41MB       

The last - a missing one - is our base image, the third line from the top is our binary and the rest is just junk.

Squash those squishy bugs!

You can build docker images with --squash flag. What is does is essentially leaving your image with just two layers - the one you started FROM; and another one that contains only files that are visible in a resulting FS (minus the FROM image).

It plays nice with layer cache - all intermediate images are still cached, so building similar docker images will yield in a high cache hit. A small catch - it's still considered experimental, though the feature available since Docker 1.13 (Jan 2017). To enable it, run your dockerd with --experimental or add "experimental": true to your /etc/docker/daemon.json. I'm also not sure about its support for SaaS container builders, but you can always spin your own docker daemon.

Lets see it in action:


# Same Dockerifle as above
$ docker build --squash -t layers:squashed
$ docker images layers:squashed --format '{{.Size}}'
4.44MB

This is exactly our alpine image with 10KB of hello binary:


$ docker inspect layers:squashed | jq '.[].RootFS.Layers'  # Just two layers as promised
[
  "sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
  "sha256:5b55011753b4704fdd9efef0ac8a56e51a552b237238af1ba5938e20e019f440"
]
$ mkdir /tmp/img && docker save layers:squashed | tar -xC /tmp/img; du -hsc /tmp/img/*
52K /tmp/img/118227640c4bf55636e129d8a2e1eaac3e70ca867db512901b35f6247b978cdd
4.5M /tmp/img/1341a124286c4b916d8732b6ae68bf3d9753cbb0a36c4c569cb517456a66af50
4.0K /tmp/img/712000f83bae1ca16c4f18e025c0141995006f01f83ea6d9d47831649a7c71f9.json
4.0K /tmp/img/manifest.json
4.0K /tmp/img/repositories
4.6M total

Neat!

Nothing is perfect though. Squashing your layers reduces potential for reusing them when pulling images. Consider the following:


$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
RUN apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev
RUN pip3 install pycrypto==2.6.1
RUN apk del .build-deps
COPY my.py /  # Just one "import Crypto" line

$ docker build -t mycrypto .
$ docker build --squash -t mycrypto:squashed .
$ docker images mycrypto
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
mycrypto            squashed            9a1e85fa63f0        11 seconds ago       58.6MB
mycrypto            latest              53b3803aa92f        About a minute ago   246MB

The difference is very positive - comparing the basic Python Alpine image I have built earlier, the squashed one here is just 2 megabytes larger. The squashed image has, again, just two layers: alpine base and the rest of our Python, pycrypto, and our code squashed.

And here is the downside: If you have 10 such Python apps on your Docker/Kubernetes host, you are going to download and store Python 10 times, and instead of having 1 alpine layer (2MB), one Python layer (~50MB) and 10 app layers (10x2MB) which is ~75MB, we end up with ~600MB.

One way to avoid this is to use proper base images, e.g. instead of basing on alpine, we can build our own Python base image and work FROM it.

Lets combine

Another technique which is widely employed is combining RUN instructions to avoid "spilling over" unnecessary layers. I.e. the above docker can be rewritten as follows:

$ cat Dockerfile-comb 
FROM alpine:3.8
RUN apk add --no-cache python3  # Other Python apps will reuse it
RUN set -ex && \
 apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev && \
 pip3 install pycrypto==2.6.1 && \
 apk del .build-deps
COPY my.py /

$ docker build -f Dockerfile-comb -t mycrypto:comb .
$ docker images mycrypto
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
mycrypto            comb                4b89e6ea6f72        7 seconds ago       59MB
mycrypto            squashed            9a1e85fa63f0        38 minutes ago      58.6MB
mycrypto            latest              53b3803aa92f        39 minutes ago      246MB

$ docker inspect  mycrypto:comb | jq '.[].RootFS.Layers'
[
  "sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
  "sha256:f9ac7d1d908f7d2afb3c724bbd5845f034aa41048afcf953672dfefdb43582d0",
  "sha256:10c59ffc3c3cb7aefbeed9db7e2dc94a39e4896941e55e26c6715649bf6c1813",
  "sha256:f0ac8bc96a6b044fe0e9b7d9452ecb6a01c1112178abad7aa80236d18be0a1f9"
]

The end result is similar to a squashed one and now we can control the layers.

Downsides? There are some.

One is a cache reuse, or lack thereof. Every single image will have to install build-base over and over. Consider some real example which has 70 lines long RUN instruction. You image may take 10 minutes to build and changing a single line in that huge instruction will start it all over.

Second is that development experience is somewhat hackish - you resort from Dockerfile mastery to shell witchery. E.g. you can easily overlook a space character chat crept after trailing backslash. This increases development times and ups our frustration - we all are humans.

Multi-stage builds

This feature is so amazing that I wonder why it is not very famous. It seems like only hard-core docker builders are aware of it.

The idea is to allow one image to borrow artifacts from another image. Lets apply it for the example that compiles C code:


$ cat Dockerfile-multi 
FROM alpine:3.8 AS builder
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base

FROM alpine:3.8
COPY --from=builder /hello /

$ docker build -f Dockerfile-multi -t layers:multi .
$ docker images layers
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
layers              multi               98329d4147f0        About a minute ago   4.42MB
layers              squashed            712000f83bae        2 hours ago          4.44MB
layers              latest              a756fa351578        2 hours ago          157MB

That is, the size is as good as it gets (even a bit better, since our squashed variant still has couple of apk metadata left by). It works just great for toolchains that produce clearly distinguishable artifacts. Here is another (simplified) example for nodejs:


FROM alpine:3.8 AS builder
RUN apk add --no-cache nodejs
COPY src /src
WORKDIR /src
RUN npm install
RUN ./node_modules/.bin/jspm install
RUN ./node_modules/.bin/gulp export  # Outputs to ./build

FROM alpine:3.8
RUN apk add --no-cache nginx
COPY --from=builder /src/build /srv/www

It's more tricky for other toolchains like Python where it's not immediately clear how to copy artifacts after pip-install'ing your app. The proper way to do it, for Python, it yet to be discovered (for me).

I will not describe other perks of this feature since Docker's documentation on the subject is quite verbose.

Conclusion

As you can probably tell there is no one ultimate method to rule them all. Alpine images are no-brainer; multi-stage provides nice & clean separation, but I lack RUN --from=...; squashing has its trade-offs; and humongous RUN instructions are still a necessary evil.

We use multi-stage approach for our nodejs images and mega-RUNs for Python ones. When I find a clean way to extract pip's artifacts, I will definitely move to multi-stage builds there as well.