Friday, September 21, 2018

Docker multi-stage builds for Python apps

Previously I highly regarded Multi-Stage Docker Image build approach, though it was not immediately clear how to apply it to Python applications.

In Python you install application dependencies and (preferably) the application itself using pip tool. When we run it during image build, pip just installs everything under /usr so there is no immediate way to copy artifacts (that is the app and its dependencies installed by pip) into the next build stage.

The solution that I came up with is to coerce pip to install everything into a dedicated directory. There are many ways of doing so, but from my experiments I found installing with --user flag and properly setting PYTHONUSERBASE as the most convenient way to install both Python libraries and app binaries (e.g. entrypoint scripts).

Eventually it's quite straight forward and I wonder why I didn't find any formal guides on this.

Without further ado, let's see how it can be done.

Setup

Lets use a sample Python Hello World project that contains a proper setup.py to install both the app's libs and the entrypoint script.

Note: I urge you to use setup.py even if you don't plan to distribute your app. Simply copying your Python sources into docker image will eventually break - you may end up copying __pycache__ directories, tests, tests fixtures, etc. Having a working setup.py makes it easy to use your app as an installable component in other apps/images.

Let's setup our test environment:


git clone git@github.com:haizaar/python-helloworld.git
cd python-helloworld/
# Add some artificial requirements to make the example more real
echo pycrypto==2.6.1 > requirements.txt

The Dockerfile

All the "magic" is happening below. I've added inline comments to ease on reading.

FROM alpine:3.8 AS builder

ENV LANG C.UTF-8

# This is our runtime
RUN apk add --no-cache python3
RUN ln -sf /usr/bin/pip3 /usr/bin/pip
RUN ln -sf /usr/bin/python3 /usr/bin/python

# This is dev runtime
RUN apk add --no-cache --virtual .build-deps build-base python3-dev
# Using latest versions, but pinning them
RUN pip install --upgrade pip==18.0
RUN pip install --upgrade setuptools==40.4.1

COPY . /build
WORKDIR /build

# This is where pip will install to
ENV PYROOT /pyroot
# A convenience to have console_scripts in PATH
ENV PATH $PYROOT/bin:$PATH

# THE MAIN COURSE #

# Install dependencies
RUN PYTHONUSERBASE=$PYROOT pip install --user -r requirements.txt
# Install our application
RUN PYTHONUSERBASE=$PYROOT pip install --user .

####################
# Production image #
####################
FROM alpine:3.8 AS prod
# This is our runtime, again
# It's better be refactored to a separate image to avoid instruction duplication
RUN apk add --no-cache python3
RUN ln -sf /usr/bin/pip3 /usr/bin/pip
RUN ln -sf /usr/bin/python3 /usr/bin/python

ENV PYROOT /pyroot
ENV PATH $PYROOT/bin:$PATH
ENV PYTHONPATH $PYROOT/lib/python:$PATH
# This is crucial for pkg_resources to work
ENV PYTHONUSERBASE $PYROOT

# Finally, copy artifacts
COPY --from=builder $PYROOT/lib/ $PYROOT/lib/
# In most cases we don't need entry points provided by other libraries
COPY --from=builder $PYROOT/bin/helloworld_in_python $PYROOT/bin/

CMD ["helloworld_in_python"]

Let's see that it works:


$ docker build -t pyhello .
$ docker run --rm -ti pyhello
Hello, world

As I mentioned before - it's really straight forward. So far I've managed to pack one of our real apps with the approach and it works well so far.

Using pipenv?

If you use pipenv, which I like a lot, you can still apply this approach. Instead of doing pipenv install --system --deploy, do the following in your Dockerfile:

ENV REQS /tmp/requirements.txt
RUN pipenv lock -r > $REQS
RUN PYTHONUSERBASE=$PYROOT pip install --user -r $REQS

Tuesday, September 18, 2018

Reducing docker image sizes

Several years ago I started a new greenfield project. Based on the state of technology affairs back then, I decided to go full time with container technology - Docker and Kubernetes. We head dived into all of the new technologies and had our application started pretty fast. Back then the majority of Docker Library was based on Debian and it resulted in quite a large images - our average Python app container image weights about 700-1000MB. Finally the time has come to rectify it.

Why do you care

Docker images are not pulled too often and 1GB is not too big of a number in the age of clouds, so why do you care? Your mileage may vary, but these are our reasons:
  • Image pull speed - on GCE it takes about 30 seconds to pull 1GB image. While downloads when pulling from GCR are almost instant, extraction takes a notable amount of time. When a GKE node crashes and pods migrate to other nodes, the image pull time adds to your application downtime. To compare - pulling of 40MB coredns image from GCR takes only 1.3 seconds.
  • Disk space on GKE nodes - when you have lots of containers and update them often, you may end up with disk space pressure. Same goes for developers' laptops.
  • Deploying off-cloud - pulling gigabytes of data is no fun when you try that over saturated 4G network during a conference demo.

Here are the strategies current available on the market.

Use alpine based images

Sounds trivial right? - they are around for quite some time already and the majority of the Docker Library has an -alpine variant. But not all alpine images were born the same:

Docker Library alpine variant of Python:


$ docker pull python3.6-alpine
$ docker images python:3.6-alpine --format '{{.Size}}'
74.2MB

DIY alpine Python:


$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
$ docker build -t alpython .
$ docker images alpython --format '{{.Size}}'
56.2MB

This is %25 size reduction compared to Docker Library Python!

Note: There is another "space-saving" project that provides a bit different approach - instead of providing a complete Linux distro, albeit a smaller one, they provide a minimal runtime base image for each Language. Have a look at Distroless.

Avoid unnecessary layers

It's quite natural to write your Dockerfile as follows:

FROM alpine:3.8
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base
RUN rm -rf hello.c

It provides nice reuse of layers cache, e.g. if hello.c changes, then we can still reuse installation of build-base package from cache. There is one problem through - in the above example, the resulting image weights 157MB(!) through actual hello binary is just 10KB:


$ cat hello.c 
#include <stdio.h>
#include <stdlib.h>

int main() {
 printf("Hello world!\n");
 return EXIT_SUCCESS;
}
$ docker build -t layers .
$ docker images layers --format '{{.Size}}'
157MB
$ docker run --rm -ti layers ls -lah /hello
-rwxr-xr-x    1 root     root       10.4K Sep 18 10:45 /hello

The reason is that each line in Dockerfile produces a new layer that constitutes a part of the image, even-through the final FS layout may not contain all of the files. You can see the hidden "convicts" using docker history:


$ docker history layers
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
8c85cd4cd954        16 minutes ago      /bin/sh -c rm -rf hello.c                       0B                  
b0f981eae17a        17 minutes ago      /bin/sh -c apk del build-base                   20.6kB              
5f5c41aaddac        17 minutes ago      /bin/sh -c gcc -Wall -o hello hello.c           10.6kB              
e820eacd8a70        18 minutes ago      /bin/sh -c #(nop) COPY file:380754830509a9a2…   104B                
0617b2ee0c0b        18 minutes ago      /bin/sh -c apk add --no-cache build-base        153MB               
196d12cf6ab1        6 days ago          /bin/sh -c #(nop)  CMD ["/bin/sh"]              0B                  
<missing>           6 days ago          /bin/sh -c #(nop) ADD file:25c10b1d1b41d46a1…   4.41MB       

The last - a missing one - is our base image, the third line from the top is our binary and the rest is just junk.

Squash those squishy bugs!

You can build docker images with --squash flag. What is does is essentially leaving your image with just two layers - the one you started FROM; and another one that contains only files that are visible in a resulting FS (minus the FROM image).

It plays nice with layer cache - all intermediate images are still cached, so building similar docker images will yield in a high cache hit. A small catch - it's still considered experimental, though the feature available since Docker 1.13 (Jan 2017). To enable it, run your dockerd with --experimental or add "experimental": true to your /etc/docker/daemon.json. I'm also not sure about its support for SaaS container builders, but you can always spin your own docker daemon.

Lets see it in action:


# Same Dockerifle as above
$ docker build --squash -t layers:squashed
$ docker images layers:squashed --format '{{.Size}}'
4.44MB

This is exactly our alpine image with 10KB of hello binary:


$ docker inspect layers:squashed | jq '.[].RootFS.Layers'  # Just two layers as promised
[
  "sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
  "sha256:5b55011753b4704fdd9efef0ac8a56e51a552b237238af1ba5938e20e019f440"
]
$ mkdir /tmp/img && docker save layers:squashed | tar -xC /tmp/img; du -hsc /tmp/img/*
52K /tmp/img/118227640c4bf55636e129d8a2e1eaac3e70ca867db512901b35f6247b978cdd
4.5M /tmp/img/1341a124286c4b916d8732b6ae68bf3d9753cbb0a36c4c569cb517456a66af50
4.0K /tmp/img/712000f83bae1ca16c4f18e025c0141995006f01f83ea6d9d47831649a7c71f9.json
4.0K /tmp/img/manifest.json
4.0K /tmp/img/repositories
4.6M total

Neat!

Nothing is perfect though. Squashing your layers reduces potential for reusing them when pulling images. Consider the following:


$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
RUN apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev
RUN pip3 install pycrypto==2.6.1
RUN apk del .build-deps
COPY my.py /  # Just one "import Crypto" line

$ docker build -t mycrypto .
$ docker build --squash -t mycrypto:squashed .
$ docker images mycrypto
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
mycrypto            squashed            9a1e85fa63f0        11 seconds ago       58.6MB
mycrypto            latest              53b3803aa92f        About a minute ago   246MB

The difference is very positive - comparing the basic Python Alpine image I have built earlier, the squashed one here is just 2 megabytes larger. The squashed image has, again, just two layers: alpine base and the rest of our Python, pycrypto, and our code squashed.

And here is the downside: If you have 10 such Python apps on your Docker/Kubernetes host, you are going to download and store Python 10 times, and instead of having 1 alpine layer (2MB), one Python layer (~50MB) and 10 app layers (10x2MB) which is ~75MB, we end up with ~600MB.

One way to avoid this is to use proper base images, e.g. instead of basing on alpine, we can build our own Python base image and work FROM it.

Lets combine

Another technique which is widely employed is combining RUN instructions to avoid "spilling over" unnecessary layers. I.e. the above docker can be rewritten as follows:

$ cat Dockerfile-comb 
FROM alpine:3.8
RUN apk add --no-cache python3  # Other Python apps will reuse it
RUN set -ex && \
 apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev && \
 pip3 install pycrypto==2.6.1 && \
 apk del .build-deps
COPY my.py /

$ docker build -f Dockerfile-comb -t mycrypto:comb .
$ docker images mycrypto
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
mycrypto            comb                4b89e6ea6f72        7 seconds ago       59MB
mycrypto            squashed            9a1e85fa63f0        38 minutes ago      58.6MB
mycrypto            latest              53b3803aa92f        39 minutes ago      246MB

$ docker inspect  mycrypto:comb | jq '.[].RootFS.Layers'
[
  "sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
  "sha256:f9ac7d1d908f7d2afb3c724bbd5845f034aa41048afcf953672dfefdb43582d0",
  "sha256:10c59ffc3c3cb7aefbeed9db7e2dc94a39e4896941e55e26c6715649bf6c1813",
  "sha256:f0ac8bc96a6b044fe0e9b7d9452ecb6a01c1112178abad7aa80236d18be0a1f9"
]

The end result is similar to a squashed one and now we can control the layers.

Downsides? There are some.

One is a cache reuse, or lack thereof. Every single image will have to install build-base over and over. Consider some real example which has 70 lines long RUN instruction. You image may take 10 minutes to build and changing a single line in that huge instruction will start it all over.

Second is that development experience is somewhat hackish - you resort from Dockerfile mastery to shell witchery. E.g. you can easily overlook a space character chat crept after trailing backslash. This increases development times and ups our frustration - we all are humans.

Multi-stage builds

This feature is so amazing that I wonder why it is not very famous. It seems like only hard-core docker builders are aware of it.

The idea is to allow one image to borrow artifacts from another image. Lets apply it for the example that compiles C code:


$ cat Dockerfile-multi 
FROM alpine:3.8 AS builder
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base

FROM alpine:3.8
COPY --from=builder /hello /

$ docker build -f Dockerfile-multi -t layers:multi .
$ docker images layers
REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE
layers              multi               98329d4147f0        About a minute ago   4.42MB
layers              squashed            712000f83bae        2 hours ago          4.44MB
layers              latest              a756fa351578        2 hours ago          157MB

That is, the size is as good as it gets (even a bit better, since our squashed variant still has couple of apk metadata left by). It works just great for toolchains that produce clearly distinguishable artifacts. Here is another (simplified) example for nodejs:


FROM alpine:3.8 AS builder
RUN apk add --no-cache nodejs
COPY src /src
WORKDIR /src
RUN npm install
RUN ./node_modules/.bin/jspm install
RUN ./node_modules/.bin/gulp export  # Outputs to ./build

FROM alpine:3.8
RUN apk add --no-cache nginx
COPY --from=builder /src/build /srv/www

It's more tricky for other toolchains like Python where it's not immediately clear how to copy artifacts after pip-install'ing your app. The proper way to do it, for Python, it yet to be discovered (for me).

I will not describe other perks of this feature since Docker's documentation on the subject is quite verbose.

Conclusion

As you can probably tell there is no one ultimate method to rule them all. Alpine images are no-brainer; multi-stage provides nice & clean separation, but I lack RUN --from=...; squashing has its trade-offs; and humongous RUN instructions are still a necessary evil.

We use multi-stage approach for our nodejs images and mega-RUNs for Python ones. When I find a clean way to extract pip's artifacts, I will definitely move to multi-stage builds there as well.

Monday, August 6, 2018

Running docker multi-stage builds on GKE

I recently worked on reducing docker image sizes for our applications and one of the approaches is to use docker multi-stage builds. It all worked well on my dev machine, but then I shoved new Dockerfiles to CI and and it all shuttered complaining that our docker server is way too old.

The thing is that GKE K8s nodes still use docker server v17.03, even on the latest K8s 1.10 they have available. If you like us run your Jenkins on GKE as well, and use K8s node's docker server for image builds, then this GKE lag will bite you one day.

There is a solution though - run your own docker server and make Jenkins to use it. Fortunately the community thought about it before and official docker images for docker itself include -dind flavour which stands for Docker-In-Docker.

Our Jenkins talked to host's docker server through /var/run/docker.sock that was mounted from host. Now instead we run DInD as a deployment and talk to it through GCP:


apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: dind
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        component: dind
    spec:
      containers:
      - name: dind
        image: docker:18.06.0-ce-dind
        env:
        - name: DOCKER_HOST
          value: tcp://0.0.0.0:2375
        args:
          - dockerd
          - --storage-driver=overlay2
          - -H tcp://0.0.0.0:2375
        ports:
        - name: http
          containerPort: 2375
        securityContext:
          privileged: true
        volumeMounts:
        - name: varlibdocker
          mountPath: /var/lib/docker
        livenessProbe:
          httpGet:
            path: /v1.38/info
            port: http
        readinessProbe:
          httpGet:
            path: /v1.38/info
            port: http
      volumes:
      - name: varlibdocker
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: dind
  labels:
    component: dind
spec:
  selector:
    component: dind
  ports:
  - name: http
    targetPort: http
    port: 2375

After loading it into your cluster you can add the following environment variable to your Jenkins containers: DOCKER_HOST=tcp://dind:2375 and verify that you are now talking to your new & shiny docker server 18.06:


root@jenkins-...-96d867487-rb5r8:/# docker version
Client:
 Version: 17.12.0-ce
 API version: 1.35
 Go version: go1.9.2
 Git commit: c97c6d6
 Built: Wed Dec 27 20:05:38 2017
 OS/Arch: linux/amd64

Server:
 Engine:
  Version: 18.06.0-ce
  API version: 1.38 (minimum version 1.12)
  Go version: go1.10.3
  Git commit: 0ffa825
  Built: Wed Jul 18 19:13:39 2018
  OS/Arch: linux/amd64
  Experimental: false

Caveat: the setup I'm describing uses emptyDir to store built docker images and cache, i.e. restarting pod will empty the cache. It's good enough for my needs, but you may consider using PV/PVC for persistence, which on GKE is trivial to setup. Using emptyDir will also consume disk space from you K8s node - something to watch for if you don't have an automatic job that purges older images.

Another small bonus of this solution that now running docker images on your Jenkins pod will only return images you have built. Previously this list would also include images of container that currently run on the node.

Thursday, December 7, 2017

Quick test for GCP inter-zone networking

Prologue: It took a year to move to Down Under and another 6 months to settle here, or at least to start feeling settled, but it looks like I'm back to writing, at least.

I'm in the process of designing how to move our systems to multi-zone deployment in GCP and wanted to have a brief understanding of the network latency and speed impacts. My Google-fu didn't yield any recent benchmarks on the subject, so I decided to run a couple of quick checks myself and share the results.

Setup

We are running in us-central1 zone and using n1-highmem-8 (8 CPUs / 50Gb RAM) instances as our main work horse. I've setup one instance in each of the zones - a, b, and c; with additional instance in zone a to measure intra-zone latency.

VMCREATOR='gcloud compute instances create \
                  --machine-type=n1-highmem-8 \
                  --image-project=ubuntu-os-cloud \
                  --image=ubuntu-1604-xenial-v20171121a'

$VMCREATOR --zone=us-central1-a us-central1-a-1 us-central1-a-2
$VMCREATOR --zone=us-central1-b us-central1-b
$VMCREATOR --zone=us-central1-c us-central1-c

Latency

I used ping to measure latency, the flooding version of it:

root@us-central1-a-1 $ ping -f -c 100000 us-central1-b
Here are the results:
A A
rtt min/avg/max/mdev = 0.041/0.072/2.882/0.036 ms, ipg/ewma 0.094/0.066 ms
A B
rtt min/avg/max/mdev = 0.132/0.193/7.032/0.073 ms, ipg/ewma 0.209/0.213 ms
A C
rtt min/avg/max/mdev = 0.123/0.189/4.110/0.060 ms, ipg/ewma 0.205/0.190 ms
B C
rtt min/avg/max/mdev = 0.123/0.176/4.399/0.047 ms, ipg/ewma 0.189/0.161 ms

While inter-zone latency is twice as big as intra-zone latency, it's still within typical LAN figures. Mean deviation is quite low as well. Too bad that ping can't count percentiles.

Throughput

I used iperf tool to measure throughput. Both unidirectional (each way) and bidirectional throughputs were measured.
  • Server side: iperf -s
  • Client side: iperf -c -t 60 -r and iperf -c -t 60 -d

Note: iperf has a bug where in client mode it ignores any parameters specified before client host, therefore it's crucial to specify the host as a first parameter.

Here are the results. All throughput numbers are in gigabits.

ZonesSendReceiveSend + Receive
A & A12.013.98.12 + 10.1
A & B7.968.224.57 + 6.30
A & C6.878.513.97 + 5.98
B & C5.757.513.05 + 3.96

Conclusion

I remember reading in GCP docs, that their zones are kilometers away from each other, yet, according to the above quick tests, they still can be treated as one huge 10Gbit LAN - that's pretty impressive. I know such technology is available for quite some time already, but it's still impressive to have it now readily available to anyone, anytime.

Saturday, April 15, 2017

My sugar findings

The posts in this blog is usually about technology subjects. However I'm on vacation for the last week and have spent several days reading about sugar and products containing it, mostly from Wikipedia. Below is the summary of my findings. Please note that I did not study neither chemistry not biology since 9th grade, so please bear with me for possible inaccuracies.

Appetizer: In the year of 2015, the world has produced 177 million tons of sugar (all types combined). This is 24 kilograms per person per year, or 70 gram per day, and surely much higher in industrialized countries.

Monosaccharides

AKA “Simple sugars”. These are the most basic types of sugar - they can not be further hydrolyzed to simpler compounds. Those relevant for humans are glucose, fructose and galactose - they are the only ones that human body can directly absorb through small intestine. Glucose can be used directly by body cells, while fructose and galactose are directed to liver for further pre-processing.

Glucose is not “bad” per-se - it’s a fuel of most living organisms on earth, including humans. However high amounts of glucose, as well as other monosaccharides, can lead to insulin resistance (diabetes) and obesity. Another problem related to intake of simple sugars, is that they are fueling acid-producing bacteria living in mouth that leads to dental caries.

Sources

Primary sources of monosaccharides in human diet are fruits (both fresh and dried), honey and, recently, HFCS - High Fructose Corn Syrup. On top of that, inverted sugar is also in use, but I will cover it separately later on.

While fruits contain high percentage of fructose, it comes together with good amount of other beneficial nutrients, e.g. dietary fiber, vitamin C and potassium. For that, fruits should not be discarded because of their fructose content - they overall are healthy products and commonly are not a reason for overweight or obesity. For example, two thirds of Australians are overweight or obese, while an average Australian eats only about one piece of fruit a day.

Note: It’s quite common in the food industry to treat dried fruits with sulfur dioxide, which is a toxic gas in its natural form. The health effects of this substance are still disputed, but since it’s done to increase shelf life and enhance visual appeal of the product, i.e. to benefit producer and not end user, I do not see a reason to buy dried fruits treated with it. Moreover, I’ve seen products labeled as organic, that still contained sulfur dioxide, i.e. the fruits themselves were from organic origin, but were treated with sulfur dioxide.

Honey, one the other hand, while generally perceived as “healthy food” is actually a bunch of empty calories. An average honey consists of 80% of sugars and 17% of water, particularly, 38% of fructose and 31% of glucose. Since honey is supersaturated liquid, containing more sugar than water, glucose tends to crystallize into solid granules floating in fructose syrup.

Note: one interesting source of honey is a honeydew secretion.

Finally, HFCS, is a sweetener produced from corn starch by breaking its carbohydrates into glucose and fructose. The resulting solution is about 50/50% on glucose/fructose (in their free form), but may vary between manufactures. This sweetener is generally available since 1970, shortly after discovery of enzymes necessary for its manufacturing process. There were some health concerns about HFCS, however nowadays they are generally dismissed - i.e. HFCS is not better of worth than any other added sugar, which, again, in case of excess intake can lead to obesity and diabetes.

Disaccharides

Disaccharide is a sugar that is formed by two joined monosaccharides. The most common examples are:
  • Lactose: glucose + galactose
  • Maltose: glucose + glucose
  • Sucrose: glucose + fructose
Disaccharides can not be absorbed by human body as they are, but require to be broken down, or hydrolyzed, to monosaccharides. To speed up the process and allow fast enough absorption, enzymes are secreted by small intestine, where disaccharides are hydrolyzed and absorbed. Dedicated enzyme is secreted for each disaccharide type, e.g. lactase, maltase and sucrase. Insufficient secretion, or lack thereof, results in body intolerance to a certain types of disaccharides, i.e. inability to absorb them in small intestine. In such case they are passed on into large intestine, where various bacteria metabolize them and the resulting fermentation process produces gases leading to detrimental health effects.

Another issue with disaccharides is that they, together with monosaccharides, provide food food to acid-producing bacteria leading to dental caries. Sucrose particularly shines here allowing anaerobic environments that boost acid production by the bacteria.

Lactose is naturally found in dairy products, but some sources say that it’s often added to bread, snacks, cereals, etc. I don’t quite remember lactose being listed on products, at least in Israel, and though I did not research on the subject, my guess is this is because it will convert products to milk-kosher, and thus can limit their consumption by end user. I did not study lactose any further. Maltose is a major component of brown rice syrup - this is how I’ve stumbled upon it initially.

Sucrose, or “table sugar”, or just “sugar” is the king of disaccharides, and all of the sweeteners together. The rest of this post will be mainly dedicated to it, but let's finish with maltose first.

Maltose

My discovery to maltose started with reading nutrition facts of organic, i.e. perceived “healthy”, candy saying “rice syrup”. Reading further, I found out that it’s a sweetener produced by breaking down starch of the whole brown rice. The traditional way to produce the syrup is to cook the rice and then to add small amount of sprouted barley grains - something that I should definitely try at home some time. Most of the current production is performed using industrial methods, as one would expect.

The outcome is, again, sweet, empty calories, for good and for bad of it. Traditionally prepared syrup can contain up to 10% of protein, however it’s usually removed in industrial products. Other than that, again, - empty calories.

Sucrose

Without further adieu, let's get to sucrose, most common of all sugars. Since Wikipedia has quite good and succinct article on sucrose, I will only mention topics that particularly thrilled me.

Note: Interestingly enough, before introduction of industrial sugar manufacturing methods, honey was the primary source of sweeteners in most parts of the world.

Humans extract sucrose from cane sugar from about 500BC. The process is quite laborious and involves juice extraction from crushed canes, boiling it to reduce water content, then, while cooling, sucrose crystallizes out. Such sugar is considered Non-centrifugal cane sugar (NCS). Today processes are quite optimized and use agents like lime (don’t confuse with lemon), and activated carbon for purification and filtering. The result is raw sugar, which is then further purified up to pure sucrose and molasses (residues).

In 19th century, sugar beet plant joined the sugar party. Slightly different process is used, but it also results in sucrose and molasses. Beet’s molasses are considered unpalatable by humans, while cane molasses are heavily used in food industry.

While it’s generally agreed that regular white sugar (sucrose) is “bad”, in recent years there is trend to substitute it with various kinds of brown sugars, which are considered healthier. Let’s explore what brown sugars are.

Brown sugar is a sucrose based sugar that has a distinctive brown color due to presence of molasses. It’s either obtained by stopping refinement process at different stages, or by re-adding molasses to pure white sugar. Regardless of the method, the only non-sugar nutritional value of brown sugars comes from their molasses, and since typical brown sugar does not contain more than 10% of molasses, its difference to white sugar is negligible, nutrition wise. Bottom line - use brown sugars, e.g. demerara, muscovado, panela, etc. because you like their taste and not because they are healthier.

This leads to conclusion that molasses is the only health-beneficial product of sugar industry. The strongest, blackstrap molasses, contains significant amount of vitamin B6 and minerals like calcium, magnesium, iron, and manganese, with one tablespoon providing 20% of daily value.

The only outstanding detrimental effect of sucrose that I have discovered (compared to other sugars) is its increased effect on tooth decay.

Misc

Caramel

Heating sugars, particularly sucrose, produces caramel. Sucrose first gets decomposed into glucose and fructose and then builds up new compounds. Surprisingly enough, this process is not well understood.

Inverted sugar

Inverted sugar syrup is produced by splitting sucrose into its components - fructose and glucose. The resulting product is alluringly sweet, even compared to sucrose. The simplest way to obtain inverted sugar is to dissolve some sucrose in water and heat it. Citric acid (1g per kg of sugar) can be added to catalyze the process. Baking soda can be used later to neutralize the acid and thus remove the sour taste.

Sucrose inversion occurs when preparing jams, since fruits naturally contain acids. Inverted sugar provides strong preserving qualities for products that use it - this is what gives jams relatively long shelf life even without additional preservatives.

Tuesday, November 22, 2016

Elasticsearch pipeline aggreagtions - monitoring used capacity

Lets say I want to setup a simple monitoring system for my desktop. The desktop uses LVM and has three volumes v1, v2 and v3, all belonging to vg1 volume group. I would like to monitor used capacity of these volumes, and the whole system, over time. It's easy to write a script that samples used capacity of the volumes and pushes it to ElasticSearch. All I need to store is:

{
  "name": "v1",
  "ts": 1479762877,
  "used_capacity": 1288404287488
}

OK, so I've put the script into cron to run every 5 minutes and the data starts pouring in. Lets do some BI on it! First thing to find out is how full my desktop is, i.e. the total capacity of all volumes. Sounds like a easy job for Kibana, isn't it? Well, not really.

Part 1: Naive failure

Let's say each of my volumes is ~1TB full. Trying to chart area viz in Kibana with Average aggregation over used_capacity returns useless results (click the below image to enlarge):

The real total system capacity is ~3TB, but Kibana, rightfully, shows that AVG(v1, v2, v3) => AVG(1TB, 1TB, 1TB) => 1TB. So may be I need Sum? Not good either:

I got ~17TB capacity number which not even close to reality. This happens because Kibana uses simple Date Histogram with nested Sum aggregation, i.e.

  • Divide selected date range into ts buckets. 30 minutes in my example.
  • Calculate Sum of used_capacity values of all documents that fall in bucket.
That's why the larger is the bucket, the more weird the results would look.

This happens because Kibana is only capable of either: $$ \underbrace{\text{SUM}\left(\begin{array}{c}v1, v1, v1,...\\ v2, v2, v2,...\\ v3, v3, v3,...\\ \end{array}\right)}_{ts\ bucket} \quad\text{or}\quad \underbrace{\text{AVG}\left(\begin{array}{c}v1, v1, v1,...\\ v2, v2, v2,...\\ v3, v3, v3,...\\ \end{array}\right)}_{ts\ bucket} $$ While what I need is: $$ \underbrace{\text{SUM}\left(\begin{array}{c}\text{AVG}(v1, v1, v1,...)\\ \text{AVG}(v2, v2, v2,...)\\ \text{AVG}(v3, v3, v3,...)\\ \end{array}\right)}_{ts\ bucket} $$ So how to achieve this?

Part 2: Poor man's solution

The post title promised pipeline aggregations and I'll get there. The problem with pipeline aggregations is that they are not supported in Kibana. So, is there still a way to get along with Kibana? - sort of. I can leverage on the fact that my sampling script takes capacity values of all volumes at exactly the same time, i.e. each bunch of volume metrics is pushed to ES with the same ts value. Now, if I force Kibana to use ts bucket length of 1 minute, I can guarantee that in any given bucket, I will only have documents belonging to a single sample batch (that's because I send measurements to ES every 5 minutes, which is much larger than the 1m bucket size).

One can argue that it generates LOTS of buckets - and he is right, but there is one optimization point to consider. ES Date histogram aggregation supports automatic pruning of buckets that do not have a minimum number of documents. The default is 0, which means empty buckets are returned, but Kibana wisely sets it to 1. Now lets say I want to see capacity data chart for last 7 days, which is 7*24*60=10080 points (buckets); however since I take measurements only every 5 minutes, most of the buckets will be pruned and we are left only with 2000, which is fare enough for Full HD screen. The nice side-effect of this is that it forces Kibana to draw really smooth charts :) Let's see it in action:

The above graph shows capacity data for last 7 days. The key point is to open and Advanced section of X-Axis dialog and put {"interval": "1m"} in JSON Input field - this overrides Kibana's automatic interval. The bottom legend, that says "ts per 3 hours", is lying, but it's the least of evils. Also note how smooth is the graph line.

Part 3: Pipeline aggregations!

The above solution works, but does not scale well beyond a single system - getting measurements from multiple systems at exactly the same time is tricky. Another drawback is that trying to looks at several months of data will result in tens of thousands of buckets which will burden both on ES, on the network and Kibana.

The right solution is to implement the correct formula. I need something like this:

SELECT AVG(used_capacity), ts FROM
    (SELECT SUM(used_capacity) AS used_capacity, DATE(ts) AS ts FROM capacity_history GROUP BY DATE(ts), name)
GROUP BY ts

Elasticsearch supports this since version 2.0 with Pipeline aggregations:

GET capacity_history/_search
{
  "size": 0,
  "aggs": {
    "ts": {
      "date_histogram": {"interval": "1h", "field": "ts"},
      "aggs": {
        "vols": {
          "terms": {"field": "name.raw", "size": 0},
          "aggs": {
            "cap": {
              "avg": {"field": "logical_capacity"}
            }
          }
        },
        "total_cap": {
          "sum_bucket": {
            "buckets_path": "vols>cap"
}}}}}}

Response

  "aggregations": {
    "ts": {
      "buckets": [
        {
          "key_as_string": "1479600000",
          "key": 1479600000000,
          "doc_count": 36,
          "vols": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "v1",
                "doc_count": 12,
                "cap": {
                  "value": 1073741824000
                }
              },
              {
                "key": "v2",
                "doc_count": 12,
                "cap": {
                  "value": 1073741824000
                }
              },
              {
                "key": "v3",
                "doc_count": 12,
                "cap": {
                  "value": 1072459894784
                }
              }
            ]
          },
          "total_cap": {
            "value": 3219943542784
          }
        },
        ...
Since we only need ts bucket key and value of total_cap aggregation, we can ask ES to filter the results to include only the data we need. In case we have lots of volumes it can reduce the amount of returned data by orders of magnitude!
GET capacity_history/_search?filter_path=aggregations.ts.buckets.key,aggregations.ts.buckets.total_cap.value,took,_shards,timed_out
...
{
  "took": 92,
  "timed_out": false,
  "_shards": {
    "total": 70,
    "successful": 70,
    "failed": 0
  },
  "aggregations": {
    "ts": {
      "buckets": [
        {
          "key": 1479600000000,
          "total_cap": {
            "value": 3219943542784
          }
        },
        {
          "key": 1479603600000,
          "total_cap": {
            "value": 3220228083712
          }
        },
        ...
NOTE: I suggest always to return meta timed_out and _shards fields to make sure you do not get partial data.

This method is generic and will work regardless of time alignment of the samples; bucket size can be adjusted to return a same amount of data points. The major drawback is that it is not supported by stock Kibana and thus you will need your own custom framework to visualize this.

Thursday, July 14, 2016

You better have persistent storage for ElasticSearch master nodes

This is followup for my previous post about whether ElasticSearch master nodes should have persistent storage - they better do!. The rest of the post demonstrates how you can have spectacular data loss with ES if master nodes do not save their state to persistent storage.

The theory

Let's say you have the following cluster with single index (single primary shard). You also have an application that constantly writes data to the index

Now what happens if all your master nodes evaporate? Well, you relaunch them with clean disks. The moment masters are up, the cluster is red, since there are no data nodes, and your application can not index data.

Now data nodes start to join. In our example, the second one joins slightly before the first. What happens is that cluster becomes green, since fresh masters do not have any idea that there is other data node that has data and is about to join.

You application happily continues to index data, into newly created index on data node 2.

Now data nodes 1 joins - masters discover that they have some old version of our index and discard it. Data loss!!!

Sounds too esoteric to happen in real life? Here is sad&true story - back in a time we ran our ES master nodes in Kubernetes without persistent disk, i.e. on local EmptyDir volumes only. One day there was short network outage - for less than an hour. Kubelets lost connection to K8s master node and killed the pods. Once the network was back, the pods were started - with clean disk volumes! - and our application resumed running. The only catch is we've lost tons data :)

The reproduction

Let's try to simulate this in practice to see what happens. I'll use the minimal ES cluster by just running three ES instances on my laptop:

  • 1 master node that also servers as a client node
  • 2 data nodes. Lets call them dnode1 and dnode2

Open three shells and lets go:

  1. Start the nodes - each in separate shell
    Master:
    /usr/share/elasticsearch/bin/elasticsearch -Des.node.data=false -Des.node.master=true -Des.node.name=master-client --path.conf=/etc/elasticsearch --default.path.logs=/tmp/master-client/logs --default.path.data=/tmp/master-client
    
    Data 01:
    /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-01 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-01/logs --default.path.data=/tmp/data-01
    
    Data 02:
    /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02   --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
    
  2. Index a document:
    curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas": 0}}'
    curl -XPUT 127.0.0.1:9200/users/user/1 -d '{"name": "Zaar"}'
    
  3. Check on which data node the index has landed. In my case, it was dnode2. Shutdown this data node and the master node (just hit CTRL-C in the shells)
  4. Simulate master data loss by issuing rm -rf /tmp/master-client/
  5. Bring master back (launch the same command)
  6. Index another document:
    curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas":0}}'
    curl -XPUT 127.0.0.1:9200/users/user/2 -d '{"name": "Hai"}'
    

Now, while dnode2 is still down, we can see that index file exists in data directories of both nodes:

$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users

/tmp/data-02/elasticsearch/nodes/0/indices/:
users

However data on dnode2 is now in "Schrodinger's cat" state - neither dead, but not exactly alive either.

Let's bring back the node two and see what happens (I've also set gateway loglevel to TRACE in /etc/elasticsearch/logging.yml for better visibility):

$ /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02   --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
[2016-07-01 17:07:13,528][INFO ][node                     ] [data-02] version[2.3.3], pid[11826], build[218bdf1/2016-05-17T15:40:04Z]
[2016-07-01 17:07:13,529][INFO ][node                     ] [data-02] initializing ...
[2016-07-01 17:07:14,265][INFO ][plugins                  ] [data-02] modules [reindex, lang-expression, lang-groovy], plugins [kopf], sites [kopf]
[2016-07-01 17:07:14,296][INFO ][env                      ] [data-02] using [1] data paths, mounts [[/ (/dev/mapper/kubuntu--vg-root)]], net usable_space [21.9gb], net total_space [212.1gb], spins? [no], types [ext4]
[2016-07-01 17:07:14,296][INFO ][env                      ] [data-02] heap size [990.7mb], compressed ordinary object pointers [true]
[2016-07-01 17:07:14,296][WARN ][env                      ] [data-02] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-07-01 17:07:16,285][DEBUG][gateway                  ] [data-02] using initial_shards [quorum]
[2016-07-01 17:07:16,513][DEBUG][indices.recovery         ] [data-02] using max_bytes_per_sec[40mb], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true]
[2016-07-01 17:07:16,563][TRACE][gateway                  ] [data-02] [upgrade]: processing [global-7.st]
[2016-07-01 17:07:16,564][TRACE][gateway                  ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,588][TRACE][gateway                  ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,589][TRACE][gateway                  ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,598][TRACE][gateway                  ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,599][TRACE][gateway                  ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway                  ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway                  ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,604][TRACE][gateway                  ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,605][DEBUG][gateway                  ] [data-02] took 5ms to load state
[2016-07-01 17:07:16,613][INFO ][node                     ] [data-02] initialized
[2016-07-01 17:07:16,614][INFO ][node                     ] [data-02] starting ...
[2016-07-01 17:07:16,714][INFO ][transport                ] [data-02] publish_address {127.0.0.1:9302}, bound_addresses {[::1]:9302}, {127.0.0.1:9302}
[2016-07-01 17:07:16,721][INFO ][discovery                ] [data-02] elasticsearch/zcQx-01tRrWQuXli-eHCTQ
[2016-07-01 17:07:19,848][INFO ][cluster.service          ] [data-02] detected_master {master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}, added {{data-01}{FnGrtAwDSDSO2j_B53I4Xg}{127.0.0.1}{127.0.0.1:9301}{master=false},{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}])
[2016-07-01 17:07:19,868][TRACE][gateway                  ] [data-02] [_global] writing state, reason [changed]
[2016-07-01 17:07:19,905][INFO ][node                     ] [data-02] started

At 17:07:16 we see the node found some data on it's own disk, but discarded it at 17:07:19 after joining the cluster. It's data dir is in fact empty:

$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users

/tmp/data-02/elasticsearch/nodes/0/indices/:

Invoking stat confirms that data directory was changed right after "writing state" message above:

$ stat /tmp/data-02/elasticsearch/nodes/0/indices/
  File: ‘/tmp/data-02/elasticsearch/nodes/0/indices/’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fc01h/64513d    Inode: 1122720     Links: 2
Access: (0775/drwxrwxr-x)  Uid: ( 1000/ haizaar)   Gid: ( 1000/ haizaar)
Access: 2016-07-01 17:08:39.093619141 +0300
Modify: 2016-07-01 17:07:19.920869352 +0300
Change: 2016-07-01 17:07:19.920869352 +0300
 Birth: -

Conclusions

  • Masters' cluster state is at least as important as data. Make sure your master node disks are backed up.
  • If running on K8s - use persistent external volumes (GCEPersistentDisk if running on GKE).
  • If possible, pause indexing after complete master outages until all of the data nodes come back.