tag:blogger.com,1999:blog-5285133391678038552024-03-26T03:08:19.538+11:00Free to codeCompelling programming and Linux hacking notes. Cloud friendly.Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.comBlogger59125tag:blogger.com,1999:blog-528513339167803855.post-18786782177591090232024-03-26T02:59:00.001+11:002024-03-26T03:07:49.412+11:00Rolling velocity of unladen human - the story of inner trust<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOK85Vhnnv0EKfq0x9K6WG17gPi84wfMm8_wFA_PDqzlH5-UEH8QLAEqA6sko6epWuaZjeT7MVs54s6pKNGOjPmv-OLq2SkGdDoTY_zcyI43nCOmqzJB6P1DVMIWCIsbrnW0AwCn51tWBZb2PcOBlk_Fazo1jk6uHYQzYHyq_LGIgphCQG5Mht4dfz9HUo/s3174/Screenshot_20240326_030506.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="800" data-original-height="1814" data-original-width="3174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOK85Vhnnv0EKfq0x9K6WG17gPi84wfMm8_wFA_PDqzlH5-UEH8QLAEqA6sko6epWuaZjeT7MVs54s6pKNGOjPmv-OLq2SkGdDoTY_zcyI43nCOmqzJB6P1DVMIWCIsbrnW0AwCn51tWBZb2PcOBlk_Fazo1jk6uHYQzYHyq_LGIgphCQG5Mht4dfz9HUo/s600/Screenshot_20240326_030506.png"/></a></div>
<p>Working from home has been a new normal for me for quite a while - I've been doing it for the past 7 years and don't see myself longing for daily trips to the office any time soon. But it has its challenges - for me, I need to get out regularly, broadly speaking. Otherwise the subtle sense of unrest begins to accumulate, layer by layer, until nothing seems enjoyable anymore.
<p>For me, the solution lies not in traditional avenues of escape like bars and clubs, but rather in the exhilarating rush of speed. At the age of 33 I obtained my motorcycle licence, relishing the sensation of propelling myself through the air. Yet, motorcycling is not a carefree experience, and even more so as you grow older, have a dear wife, daughter and other responsibilities. I still enjoy it immensely but it's not the kind of experience where you just let go and enjoy the wind. That is, not the thing I would do when I'm tired and have some layers of stress to undo.
<p>During a conversation with my psychologist I confirmed that my inner love for the sensation of wind rushing past my skin is true. Determined to experience this freedom with the need for control, I considered various options. Skydving with an instructor crossed my mind, but then the eureka moment struck: a theme park! I always loved them when I was a kid but was too scared to do the larger rides.
<p>Now, at this stage of my life it looked like a perfect opportunity to both embrace childhood dreams & fears, as well to enjoy the wind!
<p>Living in Melbourne, <a href="https://www.gumbuya.com.au/">Gumguya</a> looked like the best option. With my ticket booked, I showed up ready to try my best.
<p>My first conquest was <a href="https://www.youtube.com/watch?v=PrPry_LlKCw">TNT</a>, though I chose a seat farther from the front. Even before we started the climb, adrenaline stirred, coaxing layers of pent-up stress to the surface. And I was getting ready to scream, giving myself permission to scream, as much as I need. It felt like a well-mixed cocktail of "doing the right thing", taking the right care of myself, and being so nervous & scared at the same time. As we hurled downwards, it came time to scream. And I screamed, screamed my lungs out. At one point of the ride, a young lady on my left shouted asking if I'm OK. I could only show her thumbs up.
<p>Once disembarked, with my heart pumping, I had only one thought: "AGAIN!". And on I went, again, and then again, just enjoying myself screaming through the wind.
<p>Then I felt my body and brain are well overstimulated and it's time to absorb. I went for a 30 minute walk around the place until my body stopped feeling all spongy and the heart rate halved back to my normal one.
<p>The next challenge was <a href="https://www.youtube.com/watch?v=66KYOJBX5J8">Project Zero<a>. That scared me a lot. At one pint I got to the front of the line but ultimately baulked. The fear grip was too tight. I could've forced myself but realised I'm here to have fun, and it wasn't fun. Just fear.
<p>I went for another walk, thinking. The whole situation with Project Zero didn't make sense - it's absolutely safe, nothing can happen to me, yet I'm scared, utterly scared. It was irrational - when entering a freeway on my motorbike, I hit cruise control on 100km/h, take my hands of the handlebars and enjoy the ride steering the bike with my body lean - this is <i>objectively</i> a much more scary situation, with danger of death included. Again, irrational.
<p>I thought about astronauts for a moment - they've been truly launched into the void, without any guard rails... Somehow the thought about astronauts lifted me up and gave me courage. I was still scared, but for some reason, courage contained that fear. Renaming Project Zero to Gagarin my head, I did a ride, in the middle seat. Surprisingly it wasn't as scary as my first time TNT, potentially because on Gagarin your body has much more touch points with the sled, whereas with TNT you sit in a kind of a harness.
<p>Then it hit me - <b>my fear is caused by lack of trust!</b>. I didn't trust, so I naturally tried to be in control, of which you have zero when strapped to a moving cart, and then you panic! All I have to do is to trust these contraptions to give me a good ride and have fun. This was the matter of truly letting go! From there it all was just one great ride. I did Gagarin again, on the front seat - and it was nice! I did TNT twice again, on the front row and didn't scream - didn't want to, just leaned on the wind and enjoyed the ride.
<p>The final test to my rediscovery of inner trust was riding <a href="https://www.youtube.com/watch?v=LMa8c6iKiws">Rebel</a>. I just sank into the harness and enjoyed the physics. My jaw was so relaxed I had to pick it up after gravity forces pulled it out during the downswing. Sitting with your legs up in the air at the height of a ten story building is supposed to be scary, but since I <i>trusted</i> nothing could happen to me, I had no reasons for longing to be in control and then be scared & stressed because I obviously couldn't be in control. And after a full day of such an active meditation it was an easy thing to do.
<p>I'm now curious if a theme park will do the trick the next time I feel I need to "get out".Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-17583730432027220182024-02-07T16:58:00.004+11:002024-02-08T01:26:22.762+11:00Guarding GitHub secrets in your organization<p>GitHub Actions <a href="https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions">Secrets</a> security model is a breeze to understand when you are dealing with public, open-source projects.
Here, contributors are encouraged to fork your repository and submit changes through a pull request.
<p>But when you're working within a GitHub organization, where collaboration happens on private repositories, the game changes a bit. Forking is usually disabled and contributors are encouraged to commit their proposed
changes to a separate branch and then do a pull request to the main branch. To keep things secure, branch protection rules are set for sensitive branches like <code>main</code> to prevent unauthorized push.
<p>Sounds secure, right? But here's the catch: GitHub Action <em>secrets</em> are indeed available when running your workflow from <em>a branch</em>.
This is a stark contrast to the fork model for public repos, where secrets are <em>not</em> passed to the runner when a workflow is triggered from a forked repository [<a href="#one">1</a>].
<p>People who are new to GitHub orgnizations may be caught off guard here. For instance, if you have a secret, e.g. personal access token, that allowes pushing into a protected branch, you can request
this secret into your workflow that <em>runs in a branch</em> and still push your code into the protected branch, effectively bypassing the branch protection rules.
<p>One may argue that if you don't
trust your colleagues users to such an extent then you defeinitely have other more important issue to solve (<a href="https://en.wikipedia.org/wiki/Conway%27s_law">Conway's Law</a>) but let's face it -
we all occasionally do silly things innocently - it's part of being human.
<p>But don't worry, there are native solutions to guard against such mishaps.
<p>Let's dive into a practical use case. Suppose I'm a repo admin and would like to periodically run a code generation workflow, say, to update my auto-generated Opentofu configuraiton based on some external data. My <code>main</code>
branch is protected, requiring all pull requests to be manually approved - this is, again, to encourage everyone in my organization to contribute by creating branches in my repo, while leaving me with the final say on each pull request
to decide if the contribution makes sense.
<p>One approach I could take is to issue a personal access token (PAT), define it as secret in <em>the repo</em>, and then use it in the checkout action to interact with the git repo as yourself:
<pre><code class="language-yaml">
steps:
- name: Checkout
uses: actions/checkout@v4.1.1
with:
token: ${{ secrets.my_personal_access_token }}
- name: Code gen
run: ./generate_tf.sh
- name: Commit changes if any
uses: stefanzweifel/git-auto-commit-action@v5.0.0
</code></pre>
<p>While this approach works and <em>seems</em> to do what I want, it's <strong>severly flawed</strong>:
<ul>
<li>Any org member with the write access to the repo can access my PAT, e.g. when running a workflow in a branch, and hence to commit to the <code>main</code> branch.
<li>Making things worth, they will get write access <strong>to any repository I have write access to</strong>, including my private repositories outside the organization in question!
<ul>
<li>Yes, with <a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#fine-grained-personal-access-tokens">Fine-grained access tokens</a>
I can now limit token access to a particular repo, but as of Feb 2024 that's limited only to repos in my own account, i.e. to have a PAT applicable to repos in your orgnization, I will need to create a legacy token
that allows writes everywhere my user has access to.
</ul>
</ul>
<p>Bottom line - don't use PATs in organizations unless you are really willing to give access to everything you got on GitHub.
<p>"Of course" I hear you saying, "let's not use <em>my</em> user, let's use <em>a</em> user. E.g. let's create a user called <em>gh-bot</em>, allow it to bypass pull request protection in certain repos, issue a PAT for them and use it?"
- Sure, it will work but is similarly problematic, though not as dangerous as with your own PATs:
<ul>
<li>You need to take care of guarding credentials for that user, such as using a password manager, etc. Whoever has acess to these credentials, will have write access to the relevant repos.
<li>If you reuse this user over multiple repos, then whoever gets access to the PAT (again, by running a workflow in the branch on one repo) will have access to all those repos as well.
<li>You can of course define one such user for every repo but it's wastefull because you'll need to pay for each of them to have them in your org, and credential management gets even worth.
</ul>
<p>So what should we do? - We need some kind of machine user, a service account in GCP terms or IAM Role in AWS terms. Thankfully there is a such thing -
<a href="https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow">GitHub Apps</a>.
Yes, it may sound confusing - we need an identity, not another app to develop; but don't worry, it provides exactly that and you won't need to write any app code. Here is how it works:
<ul>
<li>You create a GitHub app. You don't need to use real Homepage URLs or Callback URL, neither implement webhooks - just enter values that makes sense, context wise. Following my previous example let's call it <code>codegen.myrepo.mycomapany.com</code>.
<li>In the app's permissions section, grant the app <code>Read & Write</code> access to <code>Contents</code>.
<li>Add the app to the list of entities allowed to bypass pull request protection rules for your repo's <code>main</code> branch.
<li>Store App ID and private key in your repo env vars / secrets.
<li>Use <code>actions/create-github-app-token</code> to create temporary token to access your github repository.
</ul>
Here is how your workflow will look like:
<pre><code class="language-yaml">
steps:
- uses: actions/create-github-app-token@v1.7.0
id: codegen-bot-token
with:
app-id: ${{ vars.CODEGEN_APP_ID }}
private-key: ${{ secrets.CODEGEN_PRIVATE_KEY }}
- name: Checkout
uses: actions/checkout@v4.1.1
with:
token: ${{ steps.codegen-bot-token.outputs.token }}
- name: Code gen
run: ./generate_tf.sh
- name: Commit changes if any
uses: stefanzweifel/git-auto-commit-action@v5.0.0
</code></pre>
<p>This solves the issue of having proper machine users - you can have one app (or more!) per repo and there is neither overhead of managing them nor additional cost incurred.
<p>There is still one issue remaining - any user in your org with the write access to the repo can still have access to the app's private key defined in the repo's secret and hence bypass the <code>main</code> branch protenction rules.
Indeed, the blast radius is now very limited to just one branch in one repo, but still, it feels pointless to have branch protection that can be easily (and accidentally) be bypassed by any user in your org.
<p>Thankfully, again, there is a native solution for this - GitHub Actions <a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment">Environments</a>.
(Not to be confused with GitHub Actions <a href="https://docs.github.com/en/actions/learn-github-actions/variables">Environment Variables</a>. Environments allow fine-grained control to who can access secrets and env vars defined
in them.
<p>So instead of storing our app ID / private key in the repo level-secrets, let's:
<ul>
<li>Create a new environment called "codegen"
<li>Limit access to that environemnt to the <code>main</code> branch
<li>Move our app ID and secret from repolevel into the environment-level
</ul>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRMSwFb9-b6T_Kxy0lGhWOzNYuQLV6iqODVh8v4nzkgnrlx_ydwS6NAI0L3wD0l1kaWIhIDLnKtkpNsvZNWDoJy-i06BqHmeHy1sq6RAUb9K-ecIVJpwLmu371CYro4eJ2CAY2sPhu5zYREmYkhKZxMsviZCXobR2F5VaZsMPHI7GEKQdCavclYWy6V6dB/s1600/Screenshot_20240207_160427.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="800" data-original-height="900" data-original-width="1237" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRMSwFb9-b6T_Kxy0lGhWOzNYuQLV6iqODVh8v4nzkgnrlx_ydwS6NAI0L3wD0l1kaWIhIDLnKtkpNsvZNWDoJy-i06BqHmeHy1sq6RAUb9K-ecIVJpwLmu371CYro4eJ2CAY2sPhu5zYREmYkhKZxMsviZCXobR2F5VaZsMPHI7GEKQdCavclYWy6V6dB/s1600/Screenshot_20240207_160427.png"/></a></div>
<p>The last touch, is to request access to this environment within the workflow file:
Here is how your workflow will look like:
<pre><code class="language-yaml">
environment: codegen
steps:
...
</code></pre>
<p>With that change in place, re-running your workflow will only work if run on the <code>main</code>. If your colleague will try to run it on any other branch in the repo (with innocous intentions most of the time), they won't be allowed:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHJcIe1HsskM0AxlcNdBEKIX9TSIZS7K2qpm6WWmP-yt2Z8ezxFIc2GTV-iiwISFWLdb9JdPK-bve-OnwgUbtR69Thvhu0AD_-n8rXgrmfKXmC4phI6SMffpJrvaZaitKIDU3GYgHlUm0qvC7uJQQqpt82kaypRz6ygEymZEKGgZtrlCp8q70yPsG_r2JV/s1600/not-allowed.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="800" data-original-height="527" data-original-width="922" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHJcIe1HsskM0AxlcNdBEKIX9TSIZS7K2qpm6WWmP-yt2Z8ezxFIc2GTV-iiwISFWLdb9JdPK-bve-OnwgUbtR69Thvhu0AD_-n8rXgrmfKXmC4phI6SMffpJrvaZaitKIDU3GYgHlUm0qvC7uJQQqpt82kaypRz6ygEymZEKGgZtrlCp8q70yPsG_r2JV/s1600/not-allowed.png"/></a></div>
<p> That's it! Now our repo access is properly configured!
<p>As you can see, designing GitHub actions workflows with princial of least priviledge mind is not obvious and quite often a well working setup can be surprisingly well open to unintended access. I hope this summary of my research on the subject will save you some time.
<h1>References</h1>
<ul>
<li id="one"> [1] https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#using-secrets-in-a-workflow
<li><a href="https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow">Making authenticated API requests with a GitHub App in a GitHub Actions workflow<a>
<li>A community <a href="https://github.com/orgs/community/discussions/25305#discussioncomment-8256560">thread</a> pointing me in this direction - thank you Florian!
<ul>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-86320700902512668852023-11-28T01:02:00.002+11:002023-12-02T23:06:26.935+11:00Practical DRY vs code readability<p>About 13 ago, Django docs formally introduced me to the DRY principal - Don't Repeat Yourself, as in "Don't pollute your project with almost identical pieces of code, but rather refactor them into a shared, reusable code base".
<p>It sounds easy and is indeed something people have doning since at least the Multics days. However trying to enforce this principal on myself was quite a struggle. Whenever I spotted repetetion emerging while writing a code, I felt compelled to refactor it immediately. Needless to say, attempting to build a shared functionality for something still in heavy development is tiresome. The context switch between writing what you want to create and refactoring as you go wore me out heavily.
<p>It took me some time to come terms with the perfectionist in myself that I do not <em>have</em> to DRY, that it's a guide, not a law.
<p>Nowadays, when prototyping, I shamelessly repeat myself. Once I get the core functionality working, I review it and apply the following priciples of my own:
<ol>
<li>Readability trumps DRY
<li>Repeating something twice is totally fine
<li>If repeating logic appears in thee or more places - I'll refactor it out <strong>IF</strong> it doesn't sacrifice readbility
<li>If I can't refactor with sacrificing readability - I probably have to redesign my code all together
</ol>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-65205100818600053842023-11-28T00:35:00.000+11:002023-11-28T00:35:01.221+11:00Kick-start your new project with terraformed GCP-GitHub actions key-less auth<p>In the past, to gain access to our GCP env inside GitHub actions, we used GitHub secrets to store GCP service account keys. It worked but for me it always felt like walking a thin line. Thankfully now GitHub support OICD tokens and we can <a href="https://cloud.google.com/blog/products/identity-security/enabling-keyless-authentication-from-github-actions">setup</a> GCP Workload Identity Federation to grant key-less access for our GitHub actions to our GCP environment.
<p>There are plenty of guides out there on how to do it but it takes some effort to follow them, particularly if you want to terraform everthing - it adds the extra work of bootstrapping terraform configuration itself (using local state to create remote state storage, upload state, switch to impersonalization, etc.). Hence, after repeating this a couple of times I decided to have repository template to save time to me and hopefully you as well.
<p>Here is it: https://github.com/zarmory/gcp-github-federation-terraform
<h2>What do you get?</h2>
After cloning and configuring this repo, with a couple of commands, you'll get the following:
<ul>
<li>Terraform state bucket created
<li>Terraform service account created and permissions assigned
<li>GitHub OIDC federation set up
<li>Sample GitHub Actions workflows to validate and apply your configuration
</ul>
<p>All in all just ~100 lines of terraform code, including comments. Basically, just clone, configure and start building.
<p>All of the code is meant to serve as a working example to encourage you hack and modify (rather than highly abstracted resuable module of sorts).
<p>This is merely an annoucement post - if interested, please continue to the repo <a href="https://github.com/zarmory/gcp-github-federation-terraform">README</a> for further details.Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-43599456261703986362023-11-16T18:34:00.003+11:002023-11-16T18:38:11.158+11:00Getting GNU Make cloud-ready<p>The title looks cheeky, but so is the issue.
<p>After release of GNU Make 4.2, the team went quiet for 4 years; however since the COVID times they've been releasing a new minor version about once a year. Surprisingly, upgrade to GNU Make 4.4 caused issues with Google Cloud SDK - the <code>gcloud</code> command.
<p>When working on internal projects, I like to have <code>CLOUDSDK_CORE_PROJECT</code> environment variable populated, but I don't wan't to preset it to a fixed value, because every person on the team have their own playground project, which I want the tool to use as the deployment target. So I came with the following Makefile:
<pre><code class="lang-makefile">
CLOUDSDK_CORE_PROJECT ?= $(shell gcloud config get-value project)
export CLOUDSDK_CORE_PROJECT
release:
@echo Deploying to project $(CLOUDSDK_CORE_PROJECT)
</code></pre>
<p>This way my toolchain will pick user's default project, which usually points to their playground. And if someone wants things done differently they can set <code>CLOUDSDK_CORE_PROJECT</code> explicity, e.g. through <code>.envrc</code> - nice and simple.
<p>This worked very well for years until I upgraded my system and started hitting the following cryptic errors when running <code>make</code>:
<pre><code class="lang-text">
$ make
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:
$ gcloud config set project PROJECT_ID
or to unset it, run:
$ gcloud config unset project
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:
$ gcloud config set project PROJECT_ID
or to unset it, run:
$ gcloud config unset project
Deploying to project
</code></pre>
<p>After quite a bit of reading and bisecting upgraded packages (which is relatively easy with NixOS) I found that Make 4.4.x is the culprit. Reading through the <a href="https://lists.gnu.org/archive/html/help-make/2022-10/msg00020.html">release notes</a> it was surprised to find a long list of backward incompatability warnings - quite astonishing for such a mature and feature-complete tool like GNU Make. Among them, the following paragraph caught my attention:
<blockquote>
Previously makefile variables marked as export were not exported to commands
started by the $(shell ...) function. Now, all exported variables are
exported to $(shell ...). If this leads to recursion during expansion, then
for backward-compatibility the value from the original environment is used.
To detect this change search for 'shell-export' in the .FEATURES variable.
</blockquote>
<p>Bingo! After that I could quickly reproduce the issue:
<pre><code class="lang-text">
$ CLOUDSDK_CORE_PROJECT= gcloud config get-value project
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:
$ gcloud config set project PROJECT_ID
or to unset it, run:
$ gcloud config unset project
</code></pre>
<p>So what happens is:
<ul>
<li><code>CLOUDSDK_CORE_PROJECT</code> is not set, so Make calls the default
<li>Since this variable is <code>export</code>'ed, Make makes it available to the shell assigning the empty string a value, which breaks <code>gcloud</code>
</ul>
<p>The fix is simple, though hacky:
<pre><code class="lang-makefile">
CLOUDSDK_CORE_PROJECT ?= $(shell unset CLOUDSDK_CORE_PROJECT; gcloud config get-value project)
export CLOUDSDK_CORE_PROJECT
release:
@echo Deploying to project $(CLOUDSDK_CORE_PROJECT)
</code></pre>
<p>I.e. if the variable is set, the default will not be called. But if it's not, we clear the empty variable from the subshell environment, thus preventing things from breaking.
<p>Eventually trivial, such small issues can easily eat out several hours of one's time, hence I'm sharing this hopefully useful nugget of knowledge on how to make your Makefile cloud-ready :)
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-4308418809844854032022-11-11T17:22:00.002+11:002022-11-11T17:24:27.858+11:00Pointing Mozilla SOPS into the right direction<p><href a="https://github.com/mozilla/sops">Mozilla SOPS</a> is a neat way to manage your secrets in git. I've been using it a lot in the last years in various project and so far I'm very happy with it.
<p>Today I stumbled up a problem where sops refused to decode my file:
<pre><code class="lang-bash">
Failed to get the data key required to decrypt the SOPS file.
Group 0: FAILED
projects/foo/locations/global/keyRings/foo-keyring/cryptoKeys/foo-global-key: FAILED
- | Error decrypting key: googleapi: Error 403: Cloud Key
| Management Service (KMS) API has not been used in project
| 123xxxxxx before or it is disabled. Enable it by visiting
| https://console.developers.google.com/apis/api/cloudkms.googleapis.com/overview?project=123xxxxxxx
| then retry. If you enabled this API recently, wait a few
| minutes for the action to propagate to our systems and
| retry.
| Details:
| [
| {
| "@type": "type.googleapis.com/google.rpc.Help",
| "links": [
| {
| "description": "Google developers console API
| activation",
| "url":
| "https://console.developers.google.com/apis/api/cloudkms.googleapis.com/overview?project=123xxxxxxx"
| }
| ]
| },
| {
| "@type": "type.googleapis.com/google.rpc.ErrorInfo",
| "domain": "googleapis.com",
| "metadata": {
| "consumer": "projects/123xxxxxxx",
| "service": "cloudkms.googleapis.com"
| },
| "reason": "SERVICE_DISABLED"
| }
| ]
| , accessNotConfigured
Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.
</code></pre>
<p>That is, sops complained that Google KMS service I use to encrypt/decrypt the keys behind the scenes is diabled in my proejct. Which didn't make sense - after all, I created KMS keys in that project so the service must be enabled.
I inspected the project id <code>123xxxxxxx</code> the error was referring to and was surprised to find out that it belongs to a project <code>bar</code> and not the project <code>foo</code> I was working on (and the one where KMS keys where stored at).
<p>After checking environment variables, KMS key location in the encrypted file I had no other options but to try <code>strace</code> on sops binary to find out was causes sops to go with project <code>bar</code> instead of <code>foo</code>. And bingo - it looked at <code>~/.config/gcloud/application_default_credentials.json</code> file which has <code>quota_project_id</code> parameter pointing straight to <code>bar</code>.
<p>One easy fix is to run <code>gcloud auth application-default set-quota-project foo</code>. It basically tells Google SDK to use <code>foo</code> as a billing project when calling KMS service (KMS API distinguishes between <em>calling project</em> and <em>resource-owning project</em> as explained <a href="https://cloud.google.com/kms/quotas">here</a>. It works but it's a fragile solution - if you are working on several projects in parallel you need to remember to switch back and forth to the correct project since these particular applicatoin-default settings can not be controlled from environment variables.
<p>What is there was a way to simply tell sops (and others) to use project owning the resource (the KMS key in my case) as a billing project as well? Apparently there is a way:
<pre><code class="lang-bash">
gcloud auth application-default login --disable-quota-project
...
Credentials saved to file: [~/.config/gcloud/application_default_credentials.json]
These credentials will be used by any library that requests Application Default Credentials (ADC).
WARNING:
Quota project is disabled. You might receive a "quota exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota project.
</code></pre>
And voilà - it works!Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-25204056775528423072022-01-21T18:38:00.000+11:002022-01-21T18:38:06.894+11:00How to STOP deletion of a Cloud Storage bucket in GCP Cloud Console<p>What if you need to delete a Cloud Storage bucket with lots of objects (tens of thousands or more)? As per GCP <a href="https://cloud.google.com/storage/docs/best-practices#:~:text=If%20you%20want%20to%20bulk%20delete%20a%20hundred%20thousand%20or%20more%20objects">docs</a> your options are:
<ul>
<li>Cloud Console
<li>Object Lifecycle policies
</ul>
<p>Cloud Console is very handy in this scenario - a couple of clicks and you are done. However, using it to delete a bucket is akin hiring an anonymous killer - once you kick off a job there is no way to stop it. But what if it was a wrong bucket? - Sitting and watching your precious data melting away is a bitter experience
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjZmWD2eSLlYP5piCWYF-axBn1RsO3PXapZUuC6mHGb-3FQlfExDOQahHJwYysBChO35kxNBhdhZZr14SboDnYhNZJa7-Odm6ce65ujaCqXN1iaTeFLbdTx_v4yLWIwz6vfEbQZiN_D_OrbRvG98LHQ1rvBk9g_V6cCXoRYoUgohPIIR-TNHN_FXnDR4w=s2640" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="1244" data-original-width="2640" src="https://blogger.googleusercontent.com/img/a/AVvXsEjZmWD2eSLlYP5piCWYF-axBn1RsO3PXapZUuC6mHGb-3FQlfExDOQahHJwYysBChO35kxNBhdhZZr14SboDnYhNZJa7-Odm6ce65ujaCqXN1iaTeFLbdTx_v4yLWIwz6vfEbQZiN_D_OrbRvG98LHQ1rvBk9g_V6cCXoRYoUgohPIIR-TNHN_FXnDR4w=s600"/></a></div>
<p>As you see, the above UI has no "OMG! Wrong bucket! Stop It Please!" button. However there is a hack to still abort a deletion job (othwerise I wouldn't be writing this post, right? :)
<p>To abort the deletion job all you need to do is to call your fellow cloud admin and ask him to deprive you, termporary (or not?) of write permissions to the bucket in question. Once your user is not able to delete objects from the in question, the deletion job will fail:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiqS2nbTNKEtjUtaQn0Dj5eyG9K9FBH8qwkuOEpei61ZzkJZq47QLejPDyuq-O3DFI4PlKVRdSzP392KRvxgoUVfwRwNPyj59lgNBeI63Dbhfus7mw_ok0-OG-z8iADMb5RAMPEci7suiBxssJZUHKcwP0fqmKPTNE4BPJpTAEctaRL49PNg-WFlD0Xfg=s3708" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="1390" data-original-width="3708" src="https://blogger.googleusercontent.com/img/a/AVvXsEiqS2nbTNKEtjUtaQn0Dj5eyG9K9FBH8qwkuOEpei61ZzkJZq47QLejPDyuq-O3DFI4PlKVRdSzP392KRvxgoUVfwRwNPyj59lgNBeI63Dbhfus7mw_ok0-OG-z8iADMb5RAMPEci7suiBxssJZUHKcwP0fqmKPTNE4BPJpTAEctaRL49PNg-WFlD0Xfg=s600"/></a></div>
<p>Cloud Console performs deletions on behalf of your user so once your permissions has been snipped it aborts the deleton job. I only wish there was a "Cancel" button in the Console UI save us from using this inconvenient hack.
<p>Of course data that was deleted up until abortion is aleady gone (15,000 objects in my example above) and the only way to restore it, aside if you had backups, is to have <a href="https://cloud.google.com/storage/docs/object-versioning">Object Versioning</a> being setup in advance.
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-1293016623429939572021-12-28T14:52:00.008+11:002022-01-21T17:23:43.542+11:00Google Chrome with taskbar entry but no actual window - how to fix<p>I recently got a new laptop - Lenovo X1 Yoga Gen 6 and thought that it's now-or-ever opportunity to head-dive into NixOS which was something I cherished for a long time.
Anyhow, I have different DPI displays and hence need to run Wayland. Things are still a bit shaky with Wayland, at least on NixOS with KDE but it's getting better every week! - that's why I'm running on the unstable channel.
<p>Every now and then after upgrade it happens that Chromium (and Chrome) open up but don't show a window. There is a taskbar entry, they respond to right-click, show recent docs in the right-click pop-up, etc., but not matter what I do, there is no window shown which makes unusable of course. This is how it looks:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEis8TxuSvE_o7I0Xab-t6RNq_y5pj6cGLl6QbV_hRAVe1xe9ZxFhbTAFV0uhFWHQhLP5IXgezi31WXlCHmSWDjBAtHLov7EaM9vG39bfWnZT9_lXThE0wtnq_tRoTeLhF5PtWQS0VZNqC3DBfSBe4lEz9Kn0zUivxUIK_byXVWEE1m4s2Z63iq6pONzSA" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="268" data-original-width="721" src="https://blogger.googleusercontent.com/img/a/AVvXsEis8TxuSvE_o7I0Xab-t6RNq_y5pj6cGLl6QbV_hRAVe1xe9ZxFhbTAFV0uhFWHQhLP5IXgezi31WXlCHmSWDjBAtHLov7EaM9vG39bfWnZT9_lXThE0wtnq_tRoTeLhF5PtWQS0VZNqC3DBfSBe4lEz9Kn0zUivxUIK_byXVWEE1m4s2Z63iq6pONzSA"/></a></div>
<h1>How to fix it?</h1>
<p><strong>TL;DR;</strong>
<pre><code class="lang-bash">
cd ~/.config/chromium/Default
cat Preferences |jq 'del(.browser.window_placement, .browser.app_window_placement)' |sponge Preferences
</code></pre>
<p>How did I figure it out? I'm no Chrome dev so I did it a CLI way:
<ol>
<li>Copied my profile <code>~/.config/chromium</code> aside, removed the original and checked that Chromium starts. I.e. it's a configuration issue
<li>Used <a href="https://en.wikipedia.org/wiki/Binary_search_algorithm" target="_blank">binary search</a> to determine which files in the profile cause the issue - namely, each time I <code>rsync -av ~/.config/chromium{.old,}/Default/</code>, removed some files, and checked if it helped. Eventually I figured out that <code>Preferences</code> file is the offender
<li>Now all was left is to compare the original and newly generated <code>Preferences</code> files. It's a single-line JSON file and I had to format it with <code>jq</code> tool first. Looking at the (huge) diff I was lucky to notice that <code>.browser.window_placement</code> configuration is different; and after copying <code>Prefences</code> from my original backup and dropping this attribute my Chromimum came back to life. Since I use Chromium web apps I had to reset <code>.browser.app_window_placement</code> as well
</ol>
A bit of patience, a bit of luck and here we are! The same cure worked for Google Chrome. Hope this will help someone who stumbles on the similar matter. Of course if you have several Chrome/Chromium profiles you need to patch their <code>Preferences</code> too.
<p><strong>Update Jan 2022:</strong> Apparently the above hack works only partially and the issue kept triggering until Chromimum 97 landed in NixOS and it never happened since.Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-14955690476375332892019-11-22T15:17:00.000+11:002019-11-22T15:17:21.734+11:00So you wanna host docs?<p>It's been long overdue to have an internal web server that would have an always up to date copy of our project documentation. We use Sphinx, and GitHub does renders RST to certain extent so I kept delaying this task till now, but it's about time to tie loose ends - so I embarked on a journey to find a way to host internal documentation with minimal maintenance efforts and costs.
<p>With today's SaaS and cloud technologies it should've been easy right? Just watch my private repo on GitHub for changes in <code>docs</code> directory, pull it, run sphinx to build static HTML pages and then upload them somewhere - not too complicated, isn't it? Let's see how the story has unfolded.
<h2>ReadTheDocs.com</h2>
<p>That's probably would've been a kinder choice - to support free software development. However I wanted my own theme and my own domain and that meant going for Advanced plan which is $150/month which is quite expensive for our small team where we write docs more often than reading them :)
Sill, it would've worked well with minimal efforts to setup and they have no thrills 30-day trial.
<p>Ruling out readthedocs.com SaaS I decided to setup docs hosting pipeline myself using one of the GCP tools - the cloud we use the most.
<h2>A convoluted trail through GCP</h2>
<p>It was time to catch up on recent GCP products I have yet had a chance to try. To build the docs, Google Cloud Build sounded great - and it is indeed. They even have dedicated app on GitHub marketplace so that builds can be triggered from pull requests and build status reflected on GitHub. That worked pretty straight forward. For hosting I decided to upload docs on Google Cloud Storage and figure out later on how to host them privately. After some tinkering I ended up with the following <code>cloudbuild.yaml</code>:
<pre><code class="lang-yaml">
steps:
# Prepare builder image - speeds up the future builds
# It just makes sure that docbuilder:current image is present on local machine.
# I have a dedicated docbuilder for each version of docs/requirements.txt.
# I could've just started each time from python docker and pip install requirements.txt
# but lxml takes several minutes to build, which is waste...
# This optimization brings build time from 7 minutes to 1!
- name: alpine:3.10
entrypoint: sh
args:
- "-c"
- |
set -ex
apk add --no-cache docker
cd docs
IMAGE=gcr.io/$PROJECT_ID/doc-builder:$(sha1sum < requirements.txt | cut -f 1 -d ' ')
if ! docker pull $$IMAGE; then
docker build -t $$IMAGE .
docker push $$IMAGE
fi
docker tag $$IMAGE docbuilder:current
# Build the docs - reuse the image we just have built
- name: docbuilder:current
entrypoint: sh
args:
- "-c"
- |
cd docs
make html
# Publish docs
# We can't use Cloud Build artifacts since they do not support recursive upload
# https://stackoverflow.com/questions/52828977/can-google-cloud-build-recurse-through-directories-of-artifacts
- name: gcr.io/cloud-builders/gsutil
args: ["-m", "cp", "-r", "docs/_build/html/*", "gs://my-docs/$BRANCH_NAME/"]
</code></pre>
<p>This approach looked very promising, for example I can build docs for different branches and access them simply by URL paths suffixes. It also outlines of the Cloud Build strengths - you can just run shell scripts of your choice to do anything you like. Finally Cloud Build provides you with free 120 minutes <em>a day</em> meaning I can build my docs every 15 minutes without it costing me a penny still.
<p>Unfortunately I hit a hosting dead end pretty quick. I want to use GCP Identity Aware Proxy (IAP) for guarding the access and it does not work with Cloud Storage yet, though it was something quite natural (for me) to expect it should've. I explored ideas about running a container that
would mount Cloud Storage bucket and serve it behind IAP, but if I end up hosting a container I'll be better off just to build my docs into a static file server. I will have to give up on ability to host docs from multiple branches together but the solution of running a container in privileged mode with pre- and post- hooks to mount GCS through FUSE didn't sound very clean and would've deprived me from using Managed Cloud Run (more on that below). I briefly explored Cloud Filestore (not Fi<strong>r</strong>estore) path, but their minimum volume size is 1TB which is $200/month - such a waste.
<p>Looks like I need to build my docs into a static-server container so why not trying to host it on Cloud Run? With amount of traffic to our docs it would only cost us... nothing since we'll stay well within the free tier. However lack of IAP support hit me again. Cloud Run supports Google Sign-In meaning it can validate your bearer tokens, but still no authentication proxy support. Hopefully they will implement one soon since it's highly anticipated, by me at least.
<p>At that point I went back to IAP docs to re-conclude what are my options. AppEngine, GCE, or GKE they were. I obviously decided on GKE since I had a utility GKE cluster with some spare capacity I could leach on. I ruled out AppEngine - no one in my team including myself had any experience with it and with GKE option readily available I saw no reason start acquiring any.
<p>From this point on it went pretty straight-forward. I created the following Dockerfile:
<pre><code class="lang-docker">
FROM python:3.7.5-alpine3.10 AS builder
RUN apk add --no-cache build-base libxml2-dev libxslt-dev graphviz
WORKDIR /build
COPY requirements.txt ./
RUN pip install --upgrade -r ./requirements.txt
COPY . ./
RUN make html
########################
FROM nginx:1.17.5-alpine AS runner
COPY --from=builder /build/_build/html /usr/share/nginx/html
</code></pre>
<p>And use the following build config:
<pre><code class="lang-yaml">
steps:
- name: gcr.io/cloud-builders/docker
args: [
"build", "-t", "gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA", "docs",
]
- name: gcr.io/cloud-builders/docker
args: [
"push", "gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA",
]
- name: gcr.io/cloud-builders/gke-deploy
args:
- run
- --filename=docs/deploy # You can pass a directory here, but you'll need to read gke-deploy code to find it out
- --image=gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA
- --location=us-central1-a
- --cluster=my-cluster
</code></pre>
<p>My build time went up again to 7 minutes unfortunately, which I tried to mitigate by using Kaniko, but hit a show-stopper <a href="https://github.com/GoogleContainerTools/kaniko/issues/870">bug</a> where it does not recognizes changes in files copied between stages. Hopefully they fix it soon. Either that or GCS will support IAP :). For the reference, the relevant Cloud Build step with Kaniko would've looked like this (instead docker build/push above):
<pre><code class="lang-yaml">
- name: gcr.io/kaniko-project/executor:latest
args:
- --cache=true
- --cache-ttl=336h # 2 weeks
- --context=/workspace/docs
- --dockerfile=/workspace/docs/Dockerfile
- --destination=gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA
</code></pre>
<p> My <code>docs/deploy</code> dir contained a single K8s <code>deployment.yaml</code> file to create K8s deployment object. <code>gke-deploy</code> can create one by default, but it also creates horizontal pod autoscaler which was really an overkill for my task. So here is my deployment:
<pre><code class="lang-yaml">
apiVersion: apps/v1
kind: Deployment
metadata:
name: docs-server
labels:
app: docs-server
spec:
replicas: 1
selector:
matchLabels:
app: docs-server
template:
metadata:
labels:
app: docs-server
spec:
containers:
- name: nginx
image: gcr.io/my-project/docs-server:latest # Will be overridden by gke-deploy
ports:
- containerPort: 80
</code></pre>
<p>At this point I had a pipeline that builds my docs into a static server and deploys it as a pod into one of my GKE clusters. The only thing that's left is to expose it to my team, securely though IAP. This is where GKE comes handy - you can request Load Balancer
with SSL certificate and IAP directly through K8s manifests! Just follow the guides: <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs">1</a>, <a href="https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer">2</a>, and <a href="https://cloud.google.com/iap/docs/enabling-kubernetes-howto">3</a>.
<p> And here we are - I now have my private docs, on custom domain secured behind IAP to share with my GCP team mates. All in all even if I would run it on a dedicated GKE with a single f1-micro instance it would've cost me less than $20 per month, meaning that if I factor costs of my time to set it up, the price difference between host-your-own and ReadTheDocs Advanced plan would pay off in less than 2 years :)
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com1tag:blogger.com,1999:blog-528513339167803855.post-22475544594653805452019-05-12T01:01:00.001+10:002023-11-16T18:02:13.244+11:00Testing Lua "classes" speed<em>"Testing" is a bit too strong word for what I've done here, but numbers are still interesting.</em>
<p>I developed a "smart" reverse proxy recently where I decided to use <a href="https://openresty.org">OpenResty</a> platform - it's basically Nginx + Lua + goodies. Lua is the first class language so theoretically you can implement anything with it.
<p>After the couple of weeks I spent with Lua it strongly reminds me JavaScript 5 - while it's a complete language, it's very "raw" in a sense that while it has constructs to do anything there is no <em>standard</em> (as in "industry-standard") way to do many things, classes being one of them. Having a strong Python background I'm used to spend my time mostly on business logic and not googling around to find best 3rd-party set/dict/etc. implementation. Many praise Lua's standard library asceticism (which reminds me similar sentiments in JS 5 days), but most of the time I get paid to create products, not tools. Also, lack of uniform way to do common tasks results in quite non-uniform code-base.
<p>Having said the above, I chose OpenResty. I already had Nginx deployed, so switching to OpenResty was a natural extension. It was exactly what I was looking for - a scriptable proxy - which is OpenResty's primary goal as a project. I didn't want to take a generic web-server and write middleware/plugin for it - it sounded a bit too adventurous and risky from security perspective. So <strike>getting back to JS 5 days</strike> using niche language like Lua was a good trade-off.
<p>Eventually I liked Lua. There is a special cuteness to it - I often find myself smiling while reading Lua code. Particularly it provided a great relief from Nginx <a href="https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/">IF evilness</a> I used before.
<p>Let's get to the point of this post, should we? While imbuing my proxy with some logic I decided to check which class-like approaches in Lua is the fastest. I ended up with 3 contenders:
<ul>
<li><a href="https://www.lua.org/pil/13.html">Metatables</a>
<li><a href="https://www.lua.org/pil/6.1.html">Closures</a>
<li><a href="https://stevedonovan.github.io/Penlight/api/libraries/pl.class.html">pl.class</a> - part of the excellent PenLight Lua library that aims to complement Lua with Python-inspired data types and utilities. This class implementation is also metatable-based but involves more internal boilerplate to support, e.g. inheritance.
</ul>
<p>I implemented class to test object member access, method invocation, and method chaining. The code is in the <a href="https://gist.github.com/haizaar/efcc083eb048d915a774ca901b0047ac">gist</a>.
<h2>Let's run it</h2>
<p> I used LuaJIT 2.1.0-beta3 that is supplied with the latest OpenResty docker image. pl.class documents two ways to define a class, hence I had two versions to see if there is any difference.
<p>Initialization speed
<pre><code class="lang-text">
Func: 815,112,512 calls/sec
Metatable: 815,737,335 calls/sec
Closure: 2,459,325 calls/sec
PLClass1: 1,536,435 calls/sec
PLClass2: 1,545,817 calls/sec
</code></pre>
<p>Initialization + call speed
<pre><code class="lang-text">
Metatable: 816,309,204 calls/sec
Closure: 2,104,911 calls/sec
PLClass1: 1,390,997 calls/sec
PLClass2: 1,453,514 calls/sec
</code></pre>
<p>We can see that Metatable is as fast as our baseline plain <code>func</code>. Also with metatable, invoke does not affect speed - probably JIT is doing amazing job here (considering the code is trivial and predictable).
<p>Closures are much slower and invocation has cost. penlight.Class, while most syntactically rich, is the slowest one and
also takes hit from invocation.
<h2>Conclusions</h2>
<p>Being myself a casual Lua developer, I prefer Closure approach:
<ul>
<li>It promotes composition
<li>Easy to understand - no implicit <code>self</code> var
<li>More importantly, it's unambiguous to use - no one needs to think whether you access something by dot or colon
</ul>
<p>Again, I'm <em>casual</em> Lua developer. Had I spent more time within I assume my brain would adjust to things like implicit <code>self</code> and may be my self-recommendation would change.
<p>For pure speed metatable is the way, though I wonder what difference it will make in real application (<a href="https://tech.zarmory.com/2015/11/time-your-assumptions.html">time your assumptions</a>).
<p>Out of curiosity, I did similar tests <a href="https://gist.github.com/haizaar/efcc083eb048d915a774ca901b0047ac#file-bench-py">in Python</a> (where there is one sane way to write this code). The results were surprising:
<p>CPython3.7
<pre><code class="lang-text">
Benchmarking init
Func: 18,378,052 ops/sec
Class: 4,760,040 ops/sec
Closure: 2,825,914 ops/sec
Benchmarking init+ivnoke
Class: 1,742,217 ops/sec
Closure: 1,549,709 ops/sec
</code></pre>
<p>PyPy3.6-7.1.1:
<pre><code class="lang-text">
Benchmarking init
Func: 1,076,386,157 ops/sec
Class: 247,935,234 ops/sec
Closure: 189,527,406 ops/sec
Benchmarking init+invoke
Class: 1,073,107,020 ops/sec
Closure: 175,466,657 ops/sec
</code></pre>
<p>On CPython if you want to do anything with your classes beside initializing them, there is no much difference between Class and Closure. "Func" aside, its performance is on par with Lua.
<p>PyPy just shines - its JIT outperforms Lua JIT by a far cry. The fact that speed of init+invoke on Class is similar to raw Func benchmark tells something about their ability to trace code that does nothing :)
<h2>On the emotional side</h2>
<p>Don't believe benchmarks - lies, damn lies, and benchmarks :)
<p>Seriously though, before thinking "why didn't they embed Python", other aspects should be contemplated:
<ul>
<li>Memory. Lua uses much less of it. Array of 10 million strings 10 byte each weighs 400mb in Lua while 700+mb in CPython/PyPy.
<li>Python was a synchronous language originally with async support introduced much later. Nginx is an async server, hence Lua fits there more naturally, but I'm speculating here.
<li>Everyone says that Lua is much easier to embed.
</ul>
Finally, both can do <a href="https://luajit.org/ext_ffi.html">amazing things</a> through <a href="https://cffi.readthedocs.io/en/latest/overview.html#main-mode-of-usage">FFI</a>.
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-18814858224300089982019-02-01T20:55:00.000+11:002019-02-01T20:57:28.204+11:00A warning about JSON serialization<p>I added caching capabilities for one of my projects by using <a href=https://aiocache.readthedocs.io/en/latest/>aiocache</a> with JSON serializer. While doing that I came over strange issue where I was putting <code>{1: "a"}</code> in cache, but received <code>{"1": "a"}</code> on retrieval - the <code>1</code> integer key came back as <code>"1"</code> string. First I thought it's an <a href=https://github.com/argaen/aiocache/issues/441>bug</a> in aiocache, but maintainer kindly pointed out that JSON, being <em>Javascript</em> Object Notation does not allow mapping keys to be non-strings.
<p>However there is a point here that's worth paying attention to - it looks like JSON libraries, at least in Python and Chrome/Firefox, will happily accept <code>{1: "a"}</code> for <em>encoding</em>
but will convert keys to strings. This may lead to quite subtle bugs as in my earlier example - cache hits will return data different to original.
<pre><code class="lang-python">
>>> import json
>>> json.dumps({1:"a"})
'{"1": "a"}'
>>> json.loads('{1:"a"}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.../lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/.../lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/.../lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
</code></pre>
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-39505323163641765002018-10-02T23:03:00.000+10:002018-10-02T23:04:18.545+10:00How Google turned a simple feature into configuration disaster<p><em>A grain of frustration with a light in the end of the tunnel.</em>
<p>Back in a while G Suite had a simple way to create email distribution lists - in the admin panel you could simply create a group of users, e.g. support@example.com, add a couple, decide whether users outside of your organization can send an email to the group address, and you are done.
<p>Today I tried to do the same - to create a distribution list - with the current version of G Suite for Business, which ended up in hours of time effort and a lengthy conversation with G Suit support.
<p>First I tried to create a group and to send an email to its address - nope, does not work:
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2P4_D3SNTMom0O-entKIKSbYCHMeQuMf3NdIYbv135PKs0ErlWuWi4pGBFW9LPVvYT1xhhMp3DXjIpNuhUwPHUZe7UIebXZ7xKOxAHicTyPdmSs41C_I68GQbn6lqEW-eQTnCqyBdV0hV/s1600/Selection_999%2528026%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2P4_D3SNTMom0O-entKIKSbYCHMeQuMf3NdIYbv135PKs0ErlWuWi4pGBFW9LPVvYT1xhhMp3DXjIpNuhUwPHUZe7UIebXZ7xKOxAHicTyPdmSs41C_I68GQbn6lqEW-eQTnCqyBdV0hV/s400/Selection_999%2528026%2529.png" width="400" height="106" data-original-width="1375" data-original-height="363" /></a></div>
<p>Grou-what? A Google Group? I don't want no Google Group! I want, you know, a distribution list!
<p>Clicking on "Access Settings" of the group properties in G Suite leads to a particular settings page on... groups.google.com. Just one of 21(!) other setting pages! My day schedule didn't include a ramp up on Google Groups, so I hooked into support chat straight away. The Google guy explained to me that with G Suite Business the only option is to configure a particular Google Group to behave as a distribution list.
<p>First we worked on enabling people outside of the organization to post (send emails) to the group. Once we configured the setting I was about to jump away to test it, but the guy told me, citing:
<pre>
Zaar Hai: OK. Saving and testing. Can you please hold on?
G Suite Support, Jay: Wait.
Zaar Hai: OK :)
G Suite Support, Jay: That is not going to work right away.
G Suite Support, Jay: We have what we called propagation.
G Suite Support, Jay: You need to wait for 24 hours propagation for the changes to take effect.
G Suite Support, Jay: Most of the time it works in less than 24 hours.
Zaar Hai: Seriously??
Zaar Hai: I thought I'm dealing with Google...
G Suite Support, Jay: Yes, we are Google.
</pre>
<p>24 hours! After soothing myself, I decided to give it a shot - it worked! The configuration <em>propagated</em> quite fast it seems.
<p>It was still not a classic distribution list though, since all of the correspondence was archived and ready to be seen on groups.google.com.
I didn't want this behaviour, so we kept digging. Eventually the guy asked me to reset the group to the <em>Email list</em> type:
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ82RQPlUGV1bh1AWRQ35i9xELLD6quVYHeYnuF_CtmeI3OslnqIUuoDBI-plXlNxM-VVdgiVvAozltHdnzYSSWjZHn1afuVxjd1o4kBWb0gTNX_M4VMD-72nvHs-2y2HdPDkCWMZoqSuq/s1600/Selection_999%2528031%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ82RQPlUGV1bh1AWRQ35i9xELLD6quVYHeYnuF_CtmeI3OslnqIUuoDBI-plXlNxM-VVdgiVvAozltHdnzYSSWjZHn1afuVxjd1o4kBWb0gTNX_M4VMD-72nvHs-2y2HdPDkCWMZoqSuq/s400/Selection_999%2528031%2529.png" width="400" height="223" data-original-width="926" data-original-height="517" /></a></div>
<p>The messages still got archived though, so we blamed it on <em>the propagation</em> and the guy advised me to come back next day if it still does not work.
<p>Well, after taking 24 hours brake, it still didn't. I did a bit of settings exploration myself and found that there is dedicated toggle responsible for the message archiving. Turns out the reset does not untoggle it. Once disabled, it <em>propagated</em> within a minute.
<p><em>That was a frustration part. Now the light - a guide on how to have a distribution list with G Suit.</em>
<h2>How to configure a G Suite groups to behave like a distribution list</h2>
<h3>Step 1: Create a group</h3>
<p>Create a group in G Suite admin console. If you need just an internal mailing list, that is for members only, and are fine with the message archiving, then you are done. If you need outside users to be able to send emails to it (like you probably do with, e.g. sales@example.com), then read on.
<h3>Step 2: Enabling external access</h3>
<ul>
<li>Go to groups.google.com.
<li>Click on "My groups" and then on <em>manage</em> link under the name of the group in question
<li>On the settings page, navigate to <em>Permissions -> Basic permis...</em> in the menu on the left
<li>In the Post row drop-down select "Anyone on the web". Click Save and you should be done
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpE8xlyMHjCnkbkBQ9oNqIXqSoMlmHtJK4kgGFDmbWCCeDoqC2oo2dUgcrYxKXv6GlzClySJ-8qDAD_4arPiyKuouSMKHfB9vinynoPKGokBJ59uKhNLsi564cQNw3bqIwEd38W8PFbyt_/s1600/Selection_999%2528032%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpE8xlyMHjCnkbkBQ9oNqIXqSoMlmHtJK4kgGFDmbWCCeDoqC2oo2dUgcrYxKXv6GlzClySJ-8qDAD_4arPiyKuouSMKHfB9vinynoPKGokBJ59uKhNLsi564cQNw3bqIwEd38W8PFbyt_/s400/Selection_999%2528032%2529.png" width="400" height="289" data-original-width="976" data-original-height="706" /></a></div>
</ul>
<p>This is almost a classic distribution list - we only need to disable archiving.
<h3>Step 3: Disable archiving</h3>
Eventually I discovered that archiving is controlled by the toggle located under <em>Information -> Content control</em> in the settings menu:
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3-GbM0bGnYmoUPVQxytTmcUtbRsJMtv_3iBoxHKpX60cU0mgYu1Z0L5FChN8cpQdeHrFKK1SBcBoEdP4mwO25dYS1WSxj79DahZf6G_gFpRVRvhGQZtvb42pC5MnCWJPaP4p5cDLZVcPq/s1600/Selection_999%2528033%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3-GbM0bGnYmoUPVQxytTmcUtbRsJMtv_3iBoxHKpX60cU0mgYu1Z0L5FChN8cpQdeHrFKK1SBcBoEdP4mwO25dYS1WSxj79DahZf6G_gFpRVRvhGQZtvb42pC5MnCWJPaP4p5cDLZVcPq/s400/Selection_999%2528033%2529.png" width="400" height="313" data-original-width="1251" data-original-height="978" /></a></div>
<p>In my case, the change went into effect immediately.
<h2>Afterthoughts</h2>
<ul>
<li>Doing all of the above steps may be quite daunting on a system administrator that needs to manage many groups. Why not to have a shortcut right in G Suite admin console to make it easier?
<li>24h propagation period sounds like some blast from the past. The Google guy told me that any G Suite setting change can take up to 24 hours to take effet. Back to the <strike>future</strike> now, Google <a href="https://en.wikipedia.org/wiki/Spanner_(database)">offers</a> a distributed database with cross-continental ACID transactions, which makes me wonder about the reasons behind 24h propagation period.
</ul>
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-14602845312173499892018-09-26T22:56:00.001+10:002018-09-26T22:56:45.216+10:00Setting GCE snapshot labels en masseI'm working on our cloud costs analysis and one the things to do here is to assign labels, e.g. <code>service=ci</code> to our resources. We massively use GCE PD snapshots for database backups and I want to label them as well, per service.
<p>You can do it through:
<li>Cloud console, max 200 snapshots at a time
<li><code>gcloud</code>, one at a time
<p>The problem is... I have thousands of snapshots to label. Running <code>gcloud</code> in a simple <code>for</code> loop takes up to several seconds per iteration, so the whole process would take a day. Therefore I crafted a script to do it in parallel, which, thanks for lesser known features of <code>xargs</code>, turned out to be really simple:
<pre><code class="lang-bash">
#!/bin/bash
set -e
NAME=$1
LABELS=$2
JOBS=${JOBS:-10}
if ! ([[ "$NAME" ]] && [[ "$LABELS" ]]); then
echo "Usage: $0 <name substring> <labels>"
echo "Label format: k1=v1,k2=v2"
exit 1
fi
gcloud compute snapshots list \
--filter "name~$NAME" \
--format="table[no-heading](name)" | \
xargs -I @ -P $JOBS -t gcloud compute snapshots update --update-labels=$LABELS @
</code></pre>
Note that each <code>gcloud</code> instance is in the air for several seconds and occupies ~80MB of RAM, i.e. running on 10 jobs can consume about 1GB of RAM easily. Obviously doing it through GCP APIs by writing a dedicated, say Python, code would not have that RAM issue, but it does not worth the effort in this case.Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com1tag:blogger.com,1999:blog-528513339167803855.post-86693722357123485222018-09-21T15:48:00.003+10:002021-02-22T14:09:45.827+11:00Docker multi-stage builds for Python appsPreviously I <a href="https://tech.zarmory.com/2018/09/reducing-docker-image-sizes.html#multi-stage">highly regarded</a> Multi-Stage Docker Image build approach, though it was not immediately clear how to apply it to Python applications.
<p>In Python you install application dependencies and (preferably) the application itself using <em>pip</em> tool. When we run it during image build, pip just installs everything under <code>/usr</code> so there is no immediate way to copy artifacts (that is the app and its dependencies installed by pip) into the next build stage.
<p>The solution that I came up with is to coerce pip to install everything into a dedicated directory. There are <a href="https://stackoverflow.com/questions/2915471/install-a-python-package-into-a-different-directory-using-pip">many ways</a> of doing so, but from my experiments I found installing with <code>--user</code> flag and properly setting <code>PYTHONUSERBASE</code> as the most convenient way to install both Python libraries and app binaries (e.g. entrypoint scripts).
<p>Eventually it's quite straight forward and I wonder why I didn't find any formal guides on this.
<p>One caveat I came along later is if you have packages already installed in system as part of pip/pipenv/setuptools dependencies, pip will not reinstall them under <code>/pyroot</code>, hence there will be missing dependencies in production image - this is the reason for using <code>--ignore-installed</code> flag.
<p>Without further ado, let's see how it can be done.
<h2>Setup</h2>
Lets use a sample Python <em>Hello World</em> project that contains a proper <code>setup.py</code> to install both the app's libs and the entrypoint script.
<p><strong>Note:</strong> I urge you to use <code>setup.py</code> even if you don't plan to distribute your app. Simply copying your Python sources into docker image will eventually break - you may end up copying <code>__pycache__</code> directories, tests, test fixtures, etc. Having a working <code>setup.py</code> makes it easy to use your app as an installable component in other apps/images.
<p>Let's setup our test environment:
<pre><code class="lang-bash">
git clone git@github.com:haizaar/python-helloworld.git
cd python-helloworld/
# Add some artificial requirements to make the example more real
echo pycrypto==2.6.1 > requirements.txt
</code></pre>
<h2>The Dockerfile</h2>
All the "magic" is happening below. I've added inline comments to ease on reading.
<pre><code class="lang-docker">
FROM alpine:3.8 AS builder
ENV LANG C.UTF-8
# This is our runtime
RUN apk add --no-cache python3
RUN ln -sf /usr/bin/pip3 /usr/bin/pip
RUN ln -sf /usr/bin/python3 /usr/bin/python
# This is dev runtime
RUN apk add --no-cache --virtual .build-deps build-base python3-dev
# Using latest versions, but pinning them
RUN pip install --upgrade pip==19.0.1
RUN pip install --upgrade setuptools==40.4.1
# This is where pip will install to
ENV PYROOT /pyroot
# A convenience to have console_scripts in PATH
ENV PATH $PYROOT/bin:$PATH
ENV PYTHONUSERBASE $PYROOT
# THE MAIN COURSE #
WORKDIR /build
# Install dependencies
COPY requirements.txt ./
RUN pip install --user --ignore-installed -r requirements.txt
# Install our application
COPY . ./
RUN pip install --user .
####################
# Production image #
####################
FROM alpine:3.8 AS prod
# This is our runtime, again
# It's better be refactored to a separate image to avoid instruction duplication
RUN apk add --no-cache python3
RUN ln -sf /usr/bin/pip3 /usr/bin/pip
RUN ln -sf /usr/bin/python3 /usr/bin/python
ENV PYROOT /pyroot
ENV PATH $PYROOT/bin:$PATH
ENV PYTHONPATH $PYROOT/lib/python:$PATH
# This is crucial for pkg_resources to work
ENV PYTHONUSERBASE $PYROOT
# Finally, copy artifacts
COPY --from=builder $PYROOT/lib/ $PYROOT/lib/
# In most cases we don't need entry points provided by other libraries
COPY --from=builder $PYROOT/bin/helloworld_in_python $PYROOT/bin/
CMD ["helloworld_in_python"]
</code></pre>
<p>Let's see that it works:
<pre><code class="lang-bash">
$ docker build -t pyhello .
$ docker run --rm -ti pyhello
Hello, world
</code></pre>
<p>As I mentioned before - it's really straight forward. So far I've managed to pack one of our real apps with the approach and it works well so far.
<h2 id="pipenv">Using pipenv?</h2>
If you use <a href="https://pipenv.readthedocs.io/en/latest/">pipenv</a>, which I like a lot, you can happily apply the same approach.
It's a bit <a href="https://github.com/pypa/pipenv/issues/3160">tricky</a> to coerce pipenv to install into a separate dir, but this command does the trick:
<pre><code class="lang-docker">
# THE MAIN COURSE #
WORKDIR /build
# Install dependencies
COPY Pipfile Pipfile.lock ./
# --ignore-installed is vital to re-install packages that are already present
# (e.g. brought by pipenv dependencies) into $PYROOT
# Need to use pip eventually because of https://github.com/pypa/pipenv/issues/4453
RUN set -ex && \
export HOME=/tmp && \
pipenv lock -r | pip install --user --ignore-installed -r /dev/stdin
# Install our application
COPY . ./
RUN pip install --user .
</code></pre>
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com7tag:blogger.com,1999:blog-528513339167803855.post-69378946867701947532018-09-19T00:14:00.000+10:002018-09-21T13:08:40.833+10:00Reducing docker image sizes<p>Several years ago I started a new greenfield project. Based on the state of technology affairs back then, I decided to go full time with container technology - Docker and Kubernetes. We head dived into all of the new technologies and had our application started pretty fast. Back then the majority of Docker Library was based on Debian and it resulted in quite a large images - our average Python app container image weights about 700-1000MB. Finally the time has come to rectify it.
<h2>Why do you care</h2>
Docker images are not pulled too often and 1GB is not too big of a number in the age of clouds, so why do you care? Your mileage may vary, but these are our reasons:
<ul>
<li> Image pull speed - on GCE it takes about 30 seconds to pull 1GB image. While downloads when pulling from GCR are almost instant, extraction takes a notable amount of time. When a GKE node crashes and pods migrate to other nodes, the image pull time <em>adds to your application downtime</em>. To compare - pulling of 40MB coredns image from GCR takes only 1.3 seconds.
<li> Disk space on GKE nodes - when you have lots of containers and update them often, you may end up with disk space pressure. Same goes for developers' laptops.
<li> Deploying off-cloud - pulling gigabytes of data is no fun when you try that over saturated 4G network during a conference demo.
</ul>
<p>Here are the strategies current available on the market.
<h2>Use alpine based images</h2>
Sounds trivial right? - they are around for quite some time already and the majority of the Docker Library has an <code>-alpine</code> variant. But not all alpine images were born the same:
<p>Docker Library alpine variant of Python:
<pre><code class="lang-bash">
$ docker pull python3.6-alpine
$ docker images python:3.6-alpine --format '{{.Size}}'
74.2MB
</code></pre>
<p>DIY alpine Python:
<pre><code class="lang-bash">
$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
$ docker build -t alpython .
$ docker images alpython --format '{{.Size}}'
56.2MB
</code></pre>
<p>This is %25 size reduction compared to Docker Library Python!
<p><strong>Note:</strong> There is another "space-saving" project that provides a bit different approach - instead of providing a complete Linux distro, albeit a smaller one, they provide a minimal <em>runtime</em> base image for each Language. Have a look at <a href="https://github.com/GoogleContainerTools/distroless">Distroless</a>.
<h2>Avoid unnecessary layers</h2>
It's quite natural to write your Dockerfile as follows:
<pre><code class="lang-docker">
FROM alpine:3.8
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base
RUN rm -rf hello.c
</code></pre>
<p>It provides nice reuse of layers cache, e.g. if <code>hello.c</code> changes, then we can still reuse installation of <code>build-base</code> package from cache.
There is one problem through - in the above example, the resulting image weights <strong>157MB</strong>(!) through actual <code>hello</code> binary is just 10KB:
<pre><code class="lang-bash">
$ cat hello.c
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("Hello world!\n");
return EXIT_SUCCESS;
}
$ docker build -t layers .
$ docker images layers --format '{{.Size}}'
157MB
$ docker run --rm -ti layers ls -lah /hello
-rwxr-xr-x 1 root root 10.4K Sep 18 10:45 /hello
</code></pre>
<p>The reason is that each line in Dockerfile produces a new layer that constitutes a part of the image, even-through the final FS layout may not contain all of the files.
You can see the hidden "convicts" using <code>docker history</code>:
<pre><code class="lang-bash">
$ docker history layers
IMAGE CREATED CREATED BY SIZE COMMENT
8c85cd4cd954 16 minutes ago /bin/sh -c rm -rf hello.c 0B
b0f981eae17a 17 minutes ago /bin/sh -c apk del build-base 20.6kB
5f5c41aaddac 17 minutes ago /bin/sh -c gcc -Wall -o hello hello.c 10.6kB
e820eacd8a70 18 minutes ago /bin/sh -c #(nop) COPY file:380754830509a9a2… 104B
0617b2ee0c0b 18 minutes ago /bin/sh -c apk add --no-cache build-base 153MB
196d12cf6ab1 6 days ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B
<missing> 6 days ago /bin/sh -c #(nop) ADD file:25c10b1d1b41d46a1… 4.41MB
</code></pre>
<p>The last - a missing one - is our base image, the third line from the top is our binary and the rest is just junk.
<h3>Squash those squishy bugs!</h3>
You can build docker images with <code>--squash</code> flag. What is does is essentially leaving your image with just two layers - the one you started FROM; and another one that contains only files that are visible in a resulting FS (minus the FROM image).
<p>It plays nice with layer cache - all intermediate images are still cached, so building similar docker images will yield in a high cache hit. A small catch - it's still considered experimental, though the feature available since Docker 1.13 (Jan 2017). To enable it, run your dockerd with <code>--experimental</code> or add <code>"experimental": true</code> to your <code>/etc/docker/daemon.json</code>. I'm also not sure about its support for SaaS container builders, but you can always <a href="https://tech.zarmory.com/2018/08/running-docker-multi-stage-builds-on-gke.html">spin</a> your own docker daemon.
<p>Lets see it in action:
<pre><code class="lang-bash">
# Same Dockerifle as above
$ docker build --squash -t layers:squashed
$ docker images layers:squashed --format '{{.Size}}'
4.44MB
</code></pre>
<p>This is exactly our alpine image with 10KB of <code>hello</code> binary:
<pre><code class="lang-bash">
$ docker inspect layers:squashed | jq '.[].RootFS.Layers' # Just two layers as promised
[
"sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
"sha256:5b55011753b4704fdd9efef0ac8a56e51a552b237238af1ba5938e20e019f440"
]
$ mkdir /tmp/img && docker save layers:squashed | tar -xC /tmp/img; du -hsc /tmp/img/*
52K /tmp/img/118227640c4bf55636e129d8a2e1eaac3e70ca867db512901b35f6247b978cdd
4.5M /tmp/img/1341a124286c4b916d8732b6ae68bf3d9753cbb0a36c4c569cb517456a66af50
4.0K /tmp/img/712000f83bae1ca16c4f18e025c0141995006f01f83ea6d9d47831649a7c71f9.json
4.0K /tmp/img/manifest.json
4.0K /tmp/img/repositories
4.6M total
</code></pre>
<p>Neat!
<p>Nothing is perfect though. Squashing your layers reduces potential for reusing them when <em>pulling images</em>. Consider the following:
<pre><code class="lang-bash">
$ cat Dockerfile
FROM alpine:3.8
RUN apk add --no-cache python3
RUN apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev
RUN pip3 install pycrypto==2.6.1
RUN apk del .build-deps
COPY my.py / # Just one "import Crypto" line
$ docker build -t mycrypto .
$ docker build --squash -t mycrypto:squashed .
$ docker images mycrypto
REPOSITORY TAG IMAGE ID CREATED SIZE
mycrypto squashed 9a1e85fa63f0 11 seconds ago 58.6MB
mycrypto latest 53b3803aa92f About a minute ago 246MB
</code></pre>
<p>The difference is very positive - comparing the basic Python Alpine image I have built earlier, the squashed one here is just 2 megabytes larger. The squashed image has, again, just two layers: alpine base and the rest of our Python, pycrypto, and our code squashed.
<p><strong>And here is the downside:</strong> If you have 10 such Python apps on your Docker/Kubernetes host, you are going to download and store Python 10 times, and instead of having 1 alpine layer (2MB), one Python layer (~50MB) and 10 app layers (10x2MB) which is ~75MB, we end up with ~600MB.
<p>One way to avoid this is to use proper base images, e.g. instead of basing on alpine, we can build our own Python base image and work FROM it.
<h3>Lets combine</h3>
Another technique which is widely employed is combining RUN instructions to avoid "spilling over" unnecessary layers. I.e. the above docker can be rewritten as follows:
<pre><code class="lang-bash">
$ cat Dockerfile-comb
FROM alpine:3.8
RUN apk add --no-cache python3 # Other Python apps will reuse it
RUN set -ex && \
apk add --no-cache --virtual .build-deps build-base openssl-dev python3-dev && \
pip3 install pycrypto==2.6.1 && \
apk del .build-deps
COPY my.py /
$ docker build -f Dockerfile-comb -t mycrypto:comb .
$ docker images mycrypto
REPOSITORY TAG IMAGE ID CREATED SIZE
mycrypto comb 4b89e6ea6f72 7 seconds ago 59MB
mycrypto squashed 9a1e85fa63f0 38 minutes ago 58.6MB
mycrypto latest 53b3803aa92f 39 minutes ago 246MB
$ docker inspect mycrypto:comb | jq '.[].RootFS.Layers'
[
"sha256:df64d3292fd6194b7865d7326af5255db6d81e9df29f48adde61a918fbd8c332",
"sha256:f9ac7d1d908f7d2afb3c724bbd5845f034aa41048afcf953672dfefdb43582d0",
"sha256:10c59ffc3c3cb7aefbeed9db7e2dc94a39e4896941e55e26c6715649bf6c1813",
"sha256:f0ac8bc96a6b044fe0e9b7d9452ecb6a01c1112178abad7aa80236d18be0a1f9"
]
</code></pre>
<p>The end result is similar to a squashed one and now we can control the layers.
<p>Downsides? There are some.
<p>One is a cache reuse, or lack thereof. Every single image will have to install <code>build-base</code> over and over. Consider some <a href="https://github.com/docker-library/python/blob/0c0365d804c2ef4ee8edef652e6a39cdf461e3b2/3.6/alpine3.8/Dockerfile">real</a> example which has 70 lines long RUN instruction. You image may take 10 minutes to build and changing a single line in that huge instruction will start it all over.
<p>Second is that development experience is somewhat hackish - you resort from Dockerfile mastery to shell witchery. E.g. you can easily overlook a space character chat crept after trailing backslash. This increases development times and ups our frustration - we all are humans.
<h2 id="multi-stage">Multi-stage builds</h2>
This <a href="https://docs.docker.com/develop/develop-images/multistage-build/">feature</a> is so amazing that I wonder why it is not very famous. It seems like only hard-core docker builders are aware of it.
<p>The idea is to allow one image to borrow artifacts from another image. Lets apply it for the example that compiles C code:
<pre><code class="lang-bash">
$ cat Dockerfile-multi
FROM alpine:3.8 AS builder
RUN apk add --no-cache build-base
COPY hello.c /
RUN gcc -Wall -o hello hello.c
RUN apk del build-base
FROM alpine:3.8
COPY --from=builder /hello /
$ docker build -f Dockerfile-multi -t layers:multi .
$ docker images layers
REPOSITORY TAG IMAGE ID CREATED SIZE
layers multi 98329d4147f0 About a minute ago 4.42MB
layers squashed 712000f83bae 2 hours ago 4.44MB
layers latest a756fa351578 2 hours ago 157MB
</code></pre>
<p>That is, the size is as good as it gets (even a bit better, since our squashed variant still has couple of apk metadata left by). It works just great for toolchains
that produce clearly distinguishable artifacts. Here is another (simplified) example for nodejs:
<pre><code class="lang-docker">
FROM alpine:3.8 AS builder
RUN apk add --no-cache nodejs
COPY src /src
WORKDIR /src
RUN npm install
RUN ./node_modules/.bin/jspm install
RUN ./node_modules/.bin/gulp export # Outputs to ./build
FROM alpine:3.8
RUN apk add --no-cache nginx
COPY --from=builder /src/build /srv/www
</code></pre>
<p>It's more tricky for other toolchains like Python where it's not immediately clear how to copy artifacts after pip-install'ing your app. The proper way to do it, for Python, it yet to be discovered (for me).
<p>I will not describe other perks of this feature since Docker's documentation on the subject is quite verbose.
<h2>Conclusion</h2>
As you can probably tell there is no one ultimate method to rule them all. Alpine images are no-brainer; multi-stage provides nice & clean separation, but I lack <code>RUN --from=...</code>; squashing has its trade-offs; and humongous RUN instructions are still a necessary evil.
<p>We use multi-stage approach for our nodejs images and mega-RUNs for Python ones. When I find a clean way to extract pip's artifacts, I will definitely move to multi-stage builds there as well.
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-4500641219401791542018-08-06T23:54:00.000+10:002018-09-19T22:45:21.577+10:00Running docker multi-stage builds on GKE<p>I recently worked on reducing docker image sizes for our applications and one of the approaches is to use docker <a href="https://docs.docker.com/develop/develop-images/multistage-build/">multi-stage builds</a>. It all worked well on my dev machine, but then I shoved new Dockerfiles to CI and and it all shuttered complaining that our docker server is way too old.
<p>The thing is that GKE K8s nodes still use docker server v17.03, even on the latest K8s 1.10 they have available. If you like us run your Jenkins on GKE as well, and use K8s node's docker server for image builds, then this GKE lag will bite you one day.
<p>There is a solution though - run your own docker server and make Jenkins to use it. Fortunately the community thought about it before and official docker images for docker itself include <code>-dind</code> flavour which stands for Docker-In-Docker.
<p>Our Jenkins talked to host's docker server through <code>/var/run/docker.sock</code> that was mounted from host. Now instead we run DInD as a deployment and talk to it through GCP:
<pre><code class="lang-yaml">
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: dind
spec:
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
component: dind
spec:
containers:
- name: dind
image: docker:18.06.0-ce-dind
env:
- name: DOCKER_HOST
value: tcp://0.0.0.0:2375
args:
- dockerd
- --storage-driver=overlay2
- -H tcp://0.0.0.0:2375
ports:
- name: http
containerPort: 2375
securityContext:
privileged: true
volumeMounts:
- name: varlibdocker
mountPath: /var/lib/docker
livenessProbe:
httpGet:
path: /v1.38/info
port: http
readinessProbe:
httpGet:
path: /v1.38/info
port: http
volumes:
- name: varlibdocker
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: dind
labels:
component: dind
spec:
selector:
component: dind
ports:
- name: http
targetPort: http
port: 2375
</code></pre>
<p>After loading it into your cluster you can add the following environment variable to your Jenkins containers: <code>DOCKER_HOST=tcp://dind:2375</code> and verify that you are now talking to your new & shiny docker server 18.06:
<pre><code class="lang-bash">
root@jenkins-...-96d867487-rb5r8:/# docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:38 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 18.06.0-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: 0ffa825
Built: Wed Jul 18 19:13:39 2018
OS/Arch: linux/amd64
Experimental: false
</code></pre>
<p><strong>Caveat:</strong> the setup I'm describing uses <code>emptyDir</code> to store built docker images and cache, i.e. restarting pod will empty the cache. It's good enough for my needs, but you may consider using PV/PVC for persistence, which on GKE is trivial to setup. Using <code>emptyDir</code> will also consume disk space from you K8s node - something to watch for if you don't have an automatic job that purges older images.
<p>Another small bonus of this solution that now running <code>docker images</code> on your Jenkins pod will only return images you have built. Previously this list would also include images of container that currently run on the node.
<!-- This loads YAML highlighter -->
<script src=https://cdnjs.cloudflare.com/ajax/libs/prettify/r298/lang-yaml.min.js></script>
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-10428571616908800022017-12-07T22:38:00.001+11:002017-12-12T14:43:56.959+11:00Quick test for GCP inter-zone networking<p><em>Prologue: It took a year to move to Down Under and another 6 months to settle here, or at least to start feeling settled, but it looks like I'm back to writing, at least.</em>
<p>I'm in the process of designing how to move our systems to multi-zone deployment in GCP and wanted to have a brief understanding of the network latency and speed impacts. My Google-fu didn't yield any recent benchmarks on the subject, so I decided to run a couple of quick checks myself and share the results.
<h2>Setup</h2>
<p>We are running in <code>us-central1</code> zone and using <code>n1-highmem-8</code> (8 CPUs / 50Gb RAM) instances as our main work horse. I've setup one instance in each of the zones - <code>a</code>, <code>b</code>, and <code>c</code>; with additional instance in zone <code>a</code> to measure intra-zone latency.
<pre class="code prettyprint">
VMCREATOR='gcloud compute instances create \
--machine-type=n1-highmem-8 \
--image-project=ubuntu-os-cloud \
--image=ubuntu-1604-xenial-v20171121a'
$VMCREATOR --zone=us-central1-a us-central1-a-1 us-central1-a-2
$VMCREATOR --zone=us-central1-b us-central1-b
$VMCREATOR --zone=us-central1-c us-central1-c
</pre>
<h2>Latency</h2>
<p>I used ping to measure latency, the flooding version of it:
<pre class="code prettyprint">
root@us-central1-a-1 $ ping -f -c 100000 us-central1-b
</pre>
Here are the results:
<dl>
<dt>A <i class="fa fa-arrows-h fa-lg" aria-hidden="true"></i> A
<dd>rtt min/<strong>avg</strong>/max/mdev = 0.041/<strong>0.072</strong>/2.882/0.036 ms, ipg/ewma 0.094/0.066 ms
<dt>A <i class="fa fa-arrows-h fa-lg" aria-hidden="true"></i> B
<dd>rtt min/<strong>avg</strong>/max/mdev = 0.132/<strong>0.193</strong>/7.032/0.073 ms, ipg/ewma 0.209/0.213 ms
<dt>A <i class="fa fa-arrows-h fa-lg" aria-hidden="true"></i> C
<dd>rtt min/<strong>avg</strong>/max/mdev = 0.123/<strong>0.189</strong>/4.110/0.060 ms, ipg/ewma 0.205/0.190 ms
<dt>B <i class="fa fa-arrows-h fa-lg" aria-hidden="true"></i> C
<dd>rtt min/<strong>avg</strong>/max/mdev = 0.123/<strong>0.176</strong>/4.399/0.047 ms, ipg/ewma 0.189/0.161 ms
</dl>
<p>While inter-zone latency is twice as big as intra-zone latency, it's still within typical LAN figures. Mean deviation is quite low as well. Too bad that ping can't count percentiles.
<h2>Throughput</h2>
I used iperf tool to measure throughput. Both unidirectional (each way) and bidirectional throughputs were measured.
<ul>
<li> Server side: <code>iperf -s</code>
<li> Client side: <code>iperf -c <host> -t 60 -r</code> and <code>iperf -c <host> -t 60 -d</code>
</ul>
<p><strong>Note:</strong> iperf has a bug where in client mode it ignores any parameters specified before client host, therefore it's crucial to specify the host as a first parameter.
<p>Here are the results. All throughput numbers are in gigabits.
<table>
<tr><th>Zones</th><th>Send</th><th>Receive</th><th>Send + Receive</th></tr>
<tr><td>A & A</td><td>12.0</td><td>13.9</td><td>8.12 + 10.1</td>
<tr><td>A & B</td><td>7.96</td><td>8.22</td><td>4.57 + 6.30</td>
<tr><td>A & C</td><td>6.87</td><td>8.51</td><td>3.97 + 5.98</td>
<tr><td>B & C</td><td>5.75</td><td>7.51</td><td>3.05 + 3.96</td>
</table>
<style>
td, th {
padding-right: 30px;
}
</style>
<p>
<h2>Conclusion</h2>
<p>I remember reading in GCP docs, that their zones are kilometers away from each other, yet, according to the above quick tests, they still can be treated as one huge 10Gbit LAN - that's pretty impressive. I know such technology is available for quite some time already, but it's still impressive to have it now readily available to anyone, anytime.
<script>prettyPrint()</script>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css">
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-4725310684319918222017-04-15T23:34:00.000+10:002017-04-15T23:34:57.307+10:00My sugar findingsThe posts in this blog is usually about technology subjects. However I'm on vacation for the last week and
have spent several days reading about sugar and products containing it, mostly from Wikipedia. Below is the summary of my findings. Please note that I did not study neither chemistry not biology since 9th grade, so please bear with me for possible inaccuracies.
<p><strong>Appetizer:</strong> In the year of 2015, the world has <a href="https://www.statista.com/statistics/273230/world-production-of-sugar/">produced</a> 177 million tons of sugar (all types combined). This is 24 kilograms per person per year, or 70 gram per day, and surely much higher in industrialized countries.
<h2>Monosaccharides</h2>
AKA “Simple sugars”. These are the most basic types of sugar - they can not be further hydrolyzed to simpler compounds. Those relevant for humans are glucose, fructose and galactose - they are the only ones that human body can directly absorb through small intestine. Glucose can be used directly by body cells, while fructose and galactose are directed to liver for further <a href="https://en.wikipedia.org/wiki/Fructose#Fructose_digestion_and_absorption_in_humans">pre-processing</a>.
<p>Glucose is not “bad” per-se - it’s a fuel of most living organisms on earth, including humans. However high amounts of glucose, as well as other monosaccharides, can lead to <a href="https://en.wikipedia.org/wiki/Insulin_resistance">insulin resistance</a> (diabetes) and obesity. Another problem related to intake of simple sugars, is that they are fueling acid-producing bacteria living in mouth that leads to <a href="https://en.wikipedia.org/wiki/Tooth_decay">dental caries</a>.
<h3>Sources</h3>
Primary sources of monosaccharides in human diet are fruits (both fresh and dried), honey and, recently, HFCS - High Fructose Corn Syrup. On top of that, <a href="https://en.wikipedia.org/wiki/Inverted_sugar_syrup">inverted sugar</a> is also in use, but I will cover it separately later on.
<p>
While fruits contain high percentage of fructose, it comes together with good amount of other beneficial nutrients, e.g. dietary fiber, vitamin C and potassium. For that, fruits should not be discarded because of their fructose content - they overall are healthy products and commonly are not a reason for overweight or obesity. For example, two thirds of Australians are <a href="http://www.huffingtonpost.com.au/2016/06/28/sugars-the-difference-between-fructose-glucose-and-sucrose/">overweight or obese</a>, while an average Australian eats only about one piece of fruit a day.
<p>
<p><blockquote>Note: It’s quite common in the food industry to treat dried fruits with sulfur dioxide, which is a toxic gas in its natural form. The health effects of this substance are still disputed, but since it’s done to increase shelf life and enhance visual appeal of the product, i.e. to benefit producer and not end user, I do not see a reason to buy dried fruits treated with it. Moreover, I’ve seen products labeled as organic, that still contained sulfur dioxide, i.e. the fruits themselves were from organic origin, but were treated with sulfur dioxide.</blockquote>
<p>
Honey, one the other hand, while generally perceived as “healthy food” is actually a bunch of <a href="https://en.wikipedia.org/wiki/Empty_calorie">empty calories</a>. An average honey consists of 80% of sugars and 17% of water, particularly, 38% of fructose and 31% of glucose. Since honey is supersaturated liquid, containing more sugar than water, glucose tends to crystallize into solid granules floating in fructose syrup.
<p><blockquote>Note: one interesting source of honey is a <a href="https://en.wikipedia.org/wiki/Honeydew_(secretion)">honeydew</a> secretion.</blockquote>
<p>
Finally, <a href="https://en.wikipedia.org/wiki/High-fructose_corn_syrup">HFCS</a>, is a sweetener produced from corn starch by breaking its carbohydrates into glucose and fructose. The resulting solution is about 50/50% on glucose/fructose (in their free form), but may vary between manufactures. This sweetener is generally available since 1970, shortly after discovery of enzymes necessary for its manufacturing process. There were some health concerns about HFCS, however nowadays they are generally dismissed - i.e. HFCS is not better of worth than any other added sugar, which, again, in case of excess intake can lead to obesity and diabetes.
<h2>Disaccharides</h2>
Disaccharide is a sugar that is formed by two joined monosaccharides. The most common examples are:
<ul>
<li>Lactose: glucose + galactose
<li>Maltose: glucose + glucose
<li>Sucrose: glucose + fructose
</ul>
Disaccharides can not be absorbed by human body as they are, but require to be broken down, or hydrolyzed, to monosaccharides. To speed up the process and allow fast enough absorption, <a href="https://en.wikipedia.org/wiki/Enzyme">enzymes</a> are secreted by small intestine, where disaccharides are hydrolyzed and absorbed. Dedicated enzyme is secreted for each disaccharide type, e.g. lactase, maltase and sucrase. Insufficient secretion, or lack thereof, results in body intolerance to a certain types of disaccharides, i.e. inability to absorb them in small intestine. In such case they are passed on into large intestine, where various bacteria metabolize them and the resulting fermentation process produces gases leading to detrimental health effects.
<p>
Another issue with disaccharides is that they, together with monosaccharides, provide food food to acid-producing bacteria leading to dental caries. Sucrose particularly <a href="https://en.wikipedia.org/wiki/Sucrose#Tooth_decay">shines</a> here allowing anaerobic environments that boost acid production by the bacteria.
<p>
Lactose is naturally found in dairy products, but some sources say that it’s often added to bread, snacks, cereals, etc. I don’t quite remember lactose being listed on products, at least in Israel, and though I did not research on the subject, my guess is this is because it will convert products to milk-kosher, and thus can limit their consumption by end user. I did not study lactose any further.
Maltose is a major component of <a href="https://en.wikipedia.org/wiki/Brown_rice_syrup">brown rice syrup</a> - this is how I’ve stumbled upon it initially.
<p>
Sucrose, or “table sugar”, or just “sugar” is the king of disaccharides, and all of the sweeteners together. The rest of this post will be mainly dedicated to it, but let's finish with maltose first.
<h3>Maltose</h3>
My discovery to maltose started with reading nutrition facts of organic, i.e. perceived “healthy”, candy saying “rice syrup”. Reading further, I found out that it’s a sweetener produced by breaking down starch of the whole brown rice. The traditional way to produce the syrup is to cook the rice and then to add small amount of sprouted barley grains - something that I should definitely try at home some time. Most of the current production is performed using industrial methods, as one would expect.
<p>
The outcome is, again, sweet, empty calories, for good and for bad of it. Traditionally prepared syrup can contain up to 10% of protein, however it’s usually <a href="http://www.mitoku.com/products/brownricemalt/making_ricemalt.html">removed</a> in industrial products. Other than that, again, - empty calories.
<h3>Sucrose</h3>
Without further adieu, let's get to sucrose, most common of all sugars. Since Wikipedia has quite good and succinct article on <a href="https://en.wikipedia.org/wiki/Sucrose">sucrose</a>, I will only mention topics that particularly thrilled me.
<p><blockquote>Note: Interestingly enough, before introduction of industrial sugar manufacturing methods, honey was the primary source of sweeteners in most <a href="https://en.wikipedia.org/wiki/Sugar#Ancient_times_and_Middle_Ages">parts of the world</a>.</blockquote>
<p>
Humans extract sucrose from cane sugar from about 500BC. The process is quite laborious and involves juice extraction from crushed canes, boiling it to reduce water content, then, while cooling, sucrose crystallizes out. Such sugar is considered Non-centrifugal cane sugar (NCS). Today processes are quite optimized and use agents like <a href="https://en.wikipedia.org/wiki/Calcium_oxide">lime</a> (don’t confuse with lemon), and activated carbon for purification and filtering. The result is raw sugar, which is then further purified up to pure sucrose and molasses (residues).
<p>
In 19th century, sugar beet plant joined the sugar party. Slightly different process is used, but it also results in sucrose and molasses. Beet’s molasses are considered unpalatable by humans, while cane molasses are heavily used in food industry.
<p>
While it’s generally agreed that regular white sugar (sucrose) is “bad”, in recent years there is trend to substitute it with various kinds of brown sugars, which are considered healthier. Let’s explore what brown sugars are.
<p>
<a href="https://en.wikipedia.org/wiki/Brown_sugar">Brown sugar</a> is a sucrose based sugar that has a distinctive brown color due to presence of molasses. It’s either obtained by stopping refinement process at different stages, or by <a href="https://en.wikipedia.org/wiki/Brown_sugar#Production">re-adding</a> molasses to pure white sugar. Regardless of the method, the only non-sugar nutritional value of brown sugars comes from their molasses, and since typical brown sugar does not contain more than 10% of molasses, its difference to white sugar is negligible, nutrition wise. Bottom line - use brown sugars, e.g. demerara, muscovado, panela, etc. because you like their taste and not because they are healthier.
<p>
This leads to conclusion that molasses is the only health-beneficial product of sugar industry. The strongest, blackstrap molasses, contains significant amount of vitamin B6 and minerals like calcium, magnesium, iron, and manganese, with one tablespoon providing 20% of daily value.
<p>
The only outstanding detrimental effect of sucrose that I have discovered (compared to other sugars) is its increased effect on <a href="https://en.wikipedia.org/wiki/Sucrose#Tooth_decay">tooth decay</a>.
<h2>Misc</h2>
<h3>Caramel</h3>
Heating sugars, particularly sucrose, produces caramel. Sucrose first gets decomposed into glucose and fructose and then builds up new compounds. Surprisingly enough, this process <a href="https://en.wikipedia.org/wiki/Caramelization#Process">is not well understood</a>.
<br><br>
<h3>Inverted sugar</h3>
Inverted sugar syrup is produced by splitting sucrose into its components - fructose and glucose. The resulting product is alluringly sweet, even compared to sucrose. The simplest way to obtain inverted sugar is to dissolve some sucrose in water and heat it. Citric acid (1g per kg of sugar) can be added to catalyze the process. Baking soda can be used later to neutralize the acid and thus remove the sour taste.
<p>
Sucrose inversion occurs when preparing jams, since fruits naturally contain acids. Inverted sugar provides strong preserving qualities for products that use it - this is what gives jams relatively long shelf life even without additional preservatives.
Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-71063783408865740432016-11-22T10:24:00.000+11:002016-11-22T19:37:13.074+11:00Elasticsearch pipeline aggreagtions - monitoring used capacity<p>Lets say I want to setup a simple monitoring system for my desktop. The desktop uses <a href="https://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux)">LVM</a> and has three volumes <code>v1</code>, <code>v2</code> and <code>v3</code>, all belonging to <code>vg1</code> volume group. I would like to monitor used capacity of these volumes, and the whole system, over time. It's easy to write a script that samples used capacity of the volumes and pushes it to ElasticSearch. All I need to store is:
<pre class="code prettyprint lang-js">
{
"name": "v1",
"ts": 1479762877,
"used_capacity": 1288404287488
}
</pre>
<p>OK, so I've put the script into <em>cron</em> to run every 5 minutes and the data starts pouring in. Lets do some BI on it! First thing to find out is how full my desktop is, i.e. the total capacity of all volumes. Sounds like a easy job for Kibana, isn't it? Well, not really.
<h2>Part 1: Naive failure</h2>
<p>Let's say each of my volumes is ~1TB full. Trying to chart area viz in Kibana with Average aggregation over <code>used_capacity</code> returns useless results (click the below image to enlarge):<br>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOhqsM_o4DHbCREsQH-9k5F6EEzoSehWfvJ4DAXhw9xnORjyWKR7GCPhqUEzdF4QJozK7Zgp8i2Ifurz-edyCUblU0N6T94FyePHBBQpe2DSmkW6wn2vV-Gm8Mx56E-aQqJoDuff8a5MFg/s1600/capacity-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOhqsM_o4DHbCREsQH-9k5F6EEzoSehWfvJ4DAXhw9xnORjyWKR7GCPhqUEzdF4QJozK7Zgp8i2Ifurz-edyCUblU0N6T94FyePHBBQpe2DSmkW6wn2vV-Gm8Mx56E-aQqJoDuff8a5MFg/s640/capacity-1.png" width="640" height="333" /></a></div>
<p>The real total system capacity is ~3TB, but Kibana, rightfully, shows that AVG(v1, v2, v3) => AVG(1TB, 1TB, 1TB) => 1TB. So may be I need Sum? Not good either:<br>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUNzjWZE5SwyzHYVZh1QN3Znypbqwmk0Oi4o3pOJqIlxmdY34HQbWAB2qaJahyTv80wcj505CmkJSOrMP_iBjfGeXs_f_H2BlK7gJN10U8m47vPn1G5o4mMxfh0PFs7dg0TNrWyELyVv0B/s1600/capacity-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUNzjWZE5SwyzHYVZh1QN3Znypbqwmk0Oi4o3pOJqIlxmdY34HQbWAB2qaJahyTv80wcj505CmkJSOrMP_iBjfGeXs_f_H2BlK7gJN10U8m47vPn1G5o4mMxfh0PFs7dg0TNrWyELyVv0B/s640/capacity-2.png" width="640" height="327" /></a></div>
<p>I got ~17TB capacity number which not even close to reality. This happens because Kibana uses simple Date Histogram with nested Sum aggregation, i.e.
<ul>
<li>Divide selected date range into <code>ts</code> buckets. 30 minutes in my example.
<li>Calculate Sum of <code>used_capacity</code> values of all documents that fall in bucket.
</ul>
That's why the larger is the bucket, the more weird the results would look.
<p>This happens because Kibana is only capable of either:
$$
\underbrace{\text{SUM}\left(\begin{array}{c}v1, v1, v1,...\\ v2, v2, v2,...\\ v3, v3, v3,...\\ \end{array}\right)}_{ts\ bucket}
\quad\text{or}\quad
\underbrace{\text{AVG}\left(\begin{array}{c}v1, v1, v1,...\\ v2, v2, v2,...\\ v3, v3, v3,...\\ \end{array}\right)}_{ts\ bucket}
$$
While what I need is:
$$
\underbrace{\text{SUM}\left(\begin{array}{c}\text{AVG}(v1, v1, v1,...)\\ \text{AVG}(v2, v2, v2,...)\\ \text{AVG}(v3, v3, v3,...)\\ \end{array}\right)}_{ts\ bucket}
$$
So how to achieve this?
<h2>Part 2: Poor man's solution</h2>
The post title promised pipeline aggregations and I'll get there. The problem with pipeline aggregations is that they are not supported in Kibana. So, is there still a way to get along with Kibana? - sort of. I can leverage on the fact that my sampling script takes capacity values of all volumes at exactly the same time, i.e. each bunch of volume metrics is pushed to ES with the same <code>ts</code> value. Now, if I force Kibana to use <code>ts bucket</code> length of 1 minute, I can guarantee that in any given bucket, I will only have documents belonging to a single sample batch (that's because I send measurements to ES every 5 minutes, which is much larger than the 1m bucket size).
<p>One can argue that it generates LOTS of buckets - and he is right, but there is one optimization point to consider. ES Date histogram aggregation supports automatic pruning of buckets that do not have a minimum number of documents. The default is 0, which means empty buckets are returned, but Kibana wisely sets it to 1. Now lets say I want to see capacity data chart for last 7 days, which is 7*24*60=10080 points (buckets); however since I take measurements only every 5 minutes, most of the buckets will be pruned and we are left only with 2000, which is fare enough for Full HD screen. The nice side-effect of this is that it forces Kibana to draw really smooth charts :) Let's see it in action:<br>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOWEzVB6Qro6PIXOCn8J1F07PSA-vs33wOefM4XUXxd-J-ELNHsHD9maoKCeqGM6ShJIpqMnzkQb7rFNtTf7TQZS_AMQ5aLzW9NLRvQQeacGMUU-HyC0DY2vHBOtpWmKiKoom0uGAXFo2H/s1600/capacity-3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOWEzVB6Qro6PIXOCn8J1F07PSA-vs33wOefM4XUXxd-J-ELNHsHD9maoKCeqGM6ShJIpqMnzkQb7rFNtTf7TQZS_AMQ5aLzW9NLRvQQeacGMUU-HyC0DY2vHBOtpWmKiKoom0uGAXFo2H/s640/capacity-3.png" width="640" height="333" /></a></div>
<p>The above graph shows capacity data for last 7 days. The key point is to open and <em>Advanced</em> section of X-Axis dialog and put <code>{"interval": "1m"}</code> in JSON Input field - this overrides Kibana's automatic interval. The bottom legend, that says "ts per 3 hours", is lying, but it's the least of evils. Also note how smooth is the graph line.
<h2>Part 3: Pipeline aggregations!</h2>
<p>The above solution works, but does not scale well beyond a single system - getting measurements from multiple systems at exactly the same time is tricky. Another drawback is that trying to looks at several months of data will result in tens of thousands of buckets which will burden both on ES, on the network and Kibana.
<p>The right solution is to implement the correct formula. I need something like this:
<pre class="code prettyprint">
SELECT AVG(used_capacity), ts FROM
(SELECT SUM(used_capacity) AS used_capacity, DATE(ts) AS ts FROM capacity_history GROUP BY DATE(ts), name)
GROUP BY ts
</pre>
<p>Elasticsearch supports this since version 2.0 with Pipeline aggregations:
<pre class="code prettyprint">
GET capacity_history/_search
{
"size": 0,
"aggs": {
"ts": {
"date_histogram": {"interval": "1h", "field": "ts"},
"aggs": {
"vols": {
"terms": {"field": "name.raw", "size": 0},
"aggs": {
"cap": {
"avg": {"field": "logical_capacity"}
}
}
},
"total_cap": {
"sum_bucket": {
"buckets_path": "vols>cap"
}}}}}}
</pre>
<p>Response
<pre class="code prettyprint">
"aggregations": {
"ts": {
"buckets": [
{
"key_as_string": "1479600000",
"key": 1479600000000,
"doc_count": 36,
"vols": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "v1",
"doc_count": 12,
"cap": {
"value": 1073741824000
}
},
{
"key": "v2",
"doc_count": 12,
"cap": {
"value": 1073741824000
}
},
{
"key": "v3",
"doc_count": 12,
"cap": {
"value": 1072459894784
}
}
]
},
"total_cap": {
"value": 3219943542784
}
},
...
</pre>
Since we only need ts bucket key and value of <code>total_cap</code> aggregation, we can ask ES to filter the results to include only the data we need. In case we have lots of volumes it can reduce the amount of returned data by orders of magnitude!
<pre class="code prettyprint">
GET capacity_history/_search?filter_path=aggregations.ts.buckets.key,aggregations.ts.buckets.total_cap.value,took,_shards,timed_out
...
</pre>
<pre class="code prettyprint">
{
"took": 92,
"timed_out": false,
"_shards": {
"total": 70,
"successful": 70,
"failed": 0
},
"aggregations": {
"ts": {
"buckets": [
{
"key": 1479600000000,
"total_cap": {
"value": 3219943542784
}
},
{
"key": 1479603600000,
"total_cap": {
"value": 3220228083712
}
},
...
</pre>
<strong>NOTE:</strong> I suggest always to return meta <code>timed_out</code> and <code>_shards</code> fields to make sure you do not get partial data.
<p>This method is generic and will work regardless of time alignment of the samples; bucket size can be adjusted to return a same amount of data points. The major drawback is that it is not supported by stock Kibana and thus you will need your own custom framework to visualize this.
<script>prettyPrint()</script>
<script type="text/javascript" async
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-70447178742916807412016-07-14T09:09:00.000+10:002016-11-23T19:44:11.951+11:00You better have persistent storage for ElasticSearch master nodes<p>This is followup for my <a href="http://tech.zarmory.com/2016/04/persistent-storage-for-elasticsearch.html">previous post</a> about whether ElasticSearch master nodes should have persistent storage - <strong>they better do!</strong>. The rest of the post demonstrates how you can have spectacular data loss with ES if master nodes do not save their state to persistent storage.
<h2>The theory</h2>
<p>Let's say you have the following cluster with single index (single primary shard). You also have an application that constantly writes data to the index
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGqZiteDU-fS9flr3H9LDD8JbAq9jmrnACgTb01Bxicl_DZiIWUznw47-a3JSYtxD3921vT9MSuecX2QnxUFNG4KgXiWqV0ce9eTNoWowit0lHMN-t2Zq0gLLfcT1ldOr_Hu0CXhhyC6P7/s1600/1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img style="background:none; box-shadow:none;" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGqZiteDU-fS9flr3H9LDD8JbAq9jmrnACgTb01Bxicl_DZiIWUznw47-a3JSYtxD3921vT9MSuecX2QnxUFNG4KgXiWqV0ce9eTNoWowit0lHMN-t2Zq0gLLfcT1ldOr_Hu0CXhhyC6P7/s1600/1.png" /></a></div>
<p>Now what happens if all your master nodes evaporate? Well, you relaunch them with clean disks.
The moment masters are up, the cluster is red, since there are no data nodes, and your application can not index data.
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiwyNBI_4qygc5MuBiJ-RJednLuJO3WGuweUEYRCp2mnhS450j4RTFj58aVhn9n9QAhyphenhyphenbzknWQNx3aM9oe5fDR37V5vFDfIQSyXldV1-JYy5aGfEy6aUJRFALSeoqtaLTSBaQgssa_fO6J/s1600/2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img style="background:none; box-shadow:none;" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiwyNBI_4qygc5MuBiJ-RJednLuJO3WGuweUEYRCp2mnhS450j4RTFj58aVhn9n9QAhyphenhyphenbzknWQNx3aM9oe5fDR37V5vFDfIQSyXldV1-JYy5aGfEy6aUJRFALSeoqtaLTSBaQgssa_fO6J/s1600/2.png" /></a></div>
<p>Now data nodes start to join. In our example, the second one joins slightly before the first. What happens is that cluster becomes green, since fresh masters do not have any idea that there is other data node that has data and is about to join.
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh0kMebjqTjtzGzn5CCTBzyNeu8glyJDXBYLvYpCsT10AVffDB9LIzZsfL-TxcsJHhamwTofkgbJl1rikW-csu1z2EZQ_N-C3uzJJI6vYezOi_TASWXjVIYpfVpMcZgkn6aD6h0_-dEW17/s1600/3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img style="background:none; box-shadow:none;" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhh0kMebjqTjtzGzn5CCTBzyNeu8glyJDXBYLvYpCsT10AVffDB9LIzZsfL-TxcsJHhamwTofkgbJl1rikW-csu1z2EZQ_N-C3uzJJI6vYezOi_TASWXjVIYpfVpMcZgkn6aD6h0_-dEW17/s1600/3.png" /></a></div>
<p>You application happily continues to index data, into newly created index on data node 2.
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh628xJkgpAUINNTuz7FWbCbeiZcWL1o9mjHxR7rbhkx5h05a7JRzodYD6c1NMvAJefiuvH6MkPq9UUhrjlnXksTbVIRokScEOt_tgwtu3IBz4tV59fLIDWHWxDW0qbmH7XKhrt6dOaR3QI/s1600/4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img style="background:none; box-shadow:none;" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh628xJkgpAUINNTuz7FWbCbeiZcWL1o9mjHxR7rbhkx5h05a7JRzodYD6c1NMvAJefiuvH6MkPq9UUhrjlnXksTbVIRokScEOt_tgwtu3IBz4tV59fLIDWHWxDW0qbmH7XKhrt6dOaR3QI/s1600/4.png" /></a></div>
<p>Now data nodes 1 joins - masters discover that they have some old version of our index and discard it. Data loss!!!
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGxLs50Yqv-ivB_AgSUnwq6kxpGv8QmU2R1LqpTTBdF4um_A_e-_NPpQgxy3cmE3sI32EEiP1JbjT0Z4kFyQiym8hHJNMIO76zOEcbM1LYmRS1xbeNIIxJLi_Kty_W5CkdQ6x1qmrmEdC6/s1600/5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img style="background:none; box-shadow:none;" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGxLs50Yqv-ivB_AgSUnwq6kxpGv8QmU2R1LqpTTBdF4um_A_e-_NPpQgxy3cmE3sI32EEiP1JbjT0Z4kFyQiym8hHJNMIO76zOEcbM1LYmRS1xbeNIIxJLi_Kty_W5CkdQ6x1qmrmEdC6/s1600/5.png" /></a></div>
<p>Sounds too esoteric to happen in real life? Here is sad&true story - back in a time we ran our ES master nodes in Kubernetes without persistent disk, i.e. on local <em>EmptyDir</em> volumes only. One day there was short network outage - for less than an hour. Kubelets lost connection to K8s master node and killed the pods. Once the network was back, the pods were started - with clean disk volumes! - and our application resumed running. The only catch is we've lost tons data :)
<h2>The reproduction</h2>
<p>Let's try to simulate this in practice to see what happens. I'll use the minimal ES cluster by just running three ES instances on my laptop:
<ul>
<li>1 master node that also servers as a client node
<li>2 data nodes. Lets call them <code>dnode1</code> and <code>dnode2</code>
</ul>
<p>Open three shells and lets go:
<ol>
<li>Start the nodes - each in separate shell<br>
Master:
<pre class="code prettyprint">
/usr/share/elasticsearch/bin/elasticsearch -Des.node.data=false -Des.node.master=true -Des.node.name=master-client --path.conf=/etc/elasticsearch --default.path.logs=/tmp/master-client/logs --default.path.data=/tmp/master-client
</pre>
Data 01:
<pre class="code prettyprint">
/usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-01 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-01/logs --default.path.data=/tmp/data-01
</pre>
Data 02:
<pre class="code prettyprint">
/usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
</pre>
<li>Index a document:
<pre class="code prettyprint">
curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas": 0}}'
curl -XPUT 127.0.0.1:9200/users/user/1 -d '{"name": "Zaar"}'
</pre>
<li>Check on which data node the index has landed. In my case, it was dnode2. Shutdown this data node and the master node (just hit <code>CTRL-C</code> in the shells)
<li>Simulate master data loss by issuing <code>rm -rf /tmp/master-client/</code>
<li>Bring master back (launch the same command)
<li>Index another document:
<pre class="code prettyprint">
curl -XPUT 127.0.0.1:9200/users?pretty -d '{"settings": {"number_of_shards": 1, "number_of_replicas":0}}'
curl -XPUT 127.0.0.1:9200/users/user/2 -d '{"name": "Hai"}'
</pre>
</ol>
<p>Now, while dnode2 is still down, we can see that index file exists in data directories of both nodes:
<pre class="code prettyprint">
$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users
/tmp/data-02/elasticsearch/nodes/0/indices/:
users
</pre>
<p>However data on <code>dnode2</code> is now in "Schrodinger's cat" state - neither dead, but not exactly alive either.
<p>Let's bring back the node two and see what happens (I've also set gateway loglevel to TRACE in <code>/etc/elasticsearch/logging.yml</code> for better visibility):
<pre class="code prettyprint">
$ /usr/share/elasticsearch/bin/elasticsearch -Des.http.enabled=false -Des.node.data=true -Des.node.master=false -Des.node.name=data-02 --path.conf=/etc/elasticsearch --default.path.logs=/tmp/data-02/logs --default.path.data=/tmp/data-02
[2016-07-01 17:07:13,528][INFO ][node ] [data-02] version[2.3.3], pid[11826], build[218bdf1/2016-05-17T15:40:04Z]
[2016-07-01 17:07:13,529][INFO ][node ] [data-02] initializing ...
[2016-07-01 17:07:14,265][INFO ][plugins ] [data-02] modules [reindex, lang-expression, lang-groovy], plugins [kopf], sites [kopf]
[2016-07-01 17:07:14,296][INFO ][env ] [data-02] using [1] data paths, mounts [[/ (/dev/mapper/kubuntu--vg-root)]], net usable_space [21.9gb], net total_space [212.1gb], spins? [no], types [ext4]
[2016-07-01 17:07:14,296][INFO ][env ] [data-02] heap size [990.7mb], compressed ordinary object pointers [true]
[2016-07-01 17:07:14,296][WARN ][env ] [data-02] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-07-01 17:07:16,285][DEBUG][gateway ] [data-02] using initial_shards [quorum]
[2016-07-01 17:07:16,513][DEBUG][indices.recovery ] [data-02] using max_bytes_per_sec[40mb], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true]
[2016-07-01 17:07:16,563][TRACE][gateway ] [data-02] [upgrade]: processing [global-7.st]
[2016-07-01 17:07:16,564][TRACE][gateway ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,588][TRACE][gateway ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,589][TRACE][gateway ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,598][TRACE][gateway ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,599][TRACE][gateway ] [data-02] found state file: [id:7, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/_state/global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway ] [data-02] state id [7] read from [global-7.st]
[2016-07-01 17:07:16,602][TRACE][gateway ] [data-02] found state file: [id:1, legacy:false, file:/tmp/data-02/elasticsearch/nodes/0/indices/users/_state/state-1.st]
[2016-07-01 17:07:16,604][TRACE][gateway ] [data-02] state id [1] read from [state-1.st]
[2016-07-01 17:07:16,605][DEBUG][gateway ] [data-02] took 5ms to load state
[2016-07-01 17:07:16,613][INFO ][node ] [data-02] initialized
[2016-07-01 17:07:16,614][INFO ][node ] [data-02] starting ...
[2016-07-01 17:07:16,714][INFO ][transport ] [data-02] publish_address {127.0.0.1:9302}, bound_addresses {[::1]:9302}, {127.0.0.1:9302}
[2016-07-01 17:07:16,721][INFO ][discovery ] [data-02] elasticsearch/zcQx-01tRrWQuXli-eHCTQ
[2016-07-01 17:07:19,848][INFO ][cluster.service ] [data-02] detected_master {master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}, added {{data-01}{FnGrtAwDSDSO2j_B53I4Xg}{127.0.0.1}{127.0.0.1:9301}{master=false},{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{master-client}{V1gaCRB8S9yj_nWFsq7uCg}{127.0.0.1}{127.0.0.1:9300}{data=false, master=true}])
[2016-07-01 17:07:19,868][TRACE][gateway ] [data-02] [_global] writing state, reason [changed]
[2016-07-01 17:07:19,905][INFO ][node ] [data-02] started
</pre>
<p>At 17:07:16 we see the node found some data on it's own disk, but discarded it at 17:07:19 after joining the cluster.
It's data dir is in fact empty:
<pre class="code prettyprint">
$ ls /tmp/data-0*/elasticsearch/nodes/0/indices/
/tmp/data-01/elasticsearch/nodes/0/indices/:
users
/tmp/data-02/elasticsearch/nodes/0/indices/:
</pre>
<p>Invoking <code>stat</code> confirms that data directory was changed right after "writing state" message above:
<pre class="code prettyprint">
$ stat /tmp/data-02/elasticsearch/nodes/0/indices/
File: ‘/tmp/data-02/elasticsearch/nodes/0/indices/’
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: fc01h/64513d Inode: 1122720 Links: 2
Access: (0775/drwxrwxr-x) Uid: ( 1000/ haizaar) Gid: ( 1000/ haizaar)
Access: 2016-07-01 17:08:39.093619141 +0300
Modify: 2016-07-01 17:07:19.920869352 +0300
Change: 2016-07-01 17:07:19.920869352 +0300
Birth: -
</pre>
<h2>Conclusions</h2>
<ul>
<li>Masters' cluster state is at least as important as data. Make sure your master node disks are backed up.
<li>If running on K8s - use persistent external volumes (GCEPersistentDisk if running on GKE).
<li>If possible, pause indexing after complete master outages until all of the data nodes come back.
</ul>
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com2tag:blogger.com,1999:blog-528513339167803855.post-7008693477939277342016-04-27T01:46:00.000+10:002016-04-29T15:25:36.850+10:00How Kubernetes applies resource limits<p>We are building one of our products on a cloud and decided to run it entirely on Kubernetes cluster. One of the big pains that is relieved by containers is resource separation between different processes (modules) of your system. Let's say we have a product that comprises of several services that talk to each other ("microservices" as it is now fashionably called). Before containers, or, to be more precise, before Linux kernel control groups were introduced, we had several options to try to ensure that they do not step on each other:
<ul>
<li> Run each microservice on a separate VM, which is usually wasteful
<li> Play with CPU affinity for each microservice, on each VM - this saves you only from CPU hogs, but not from memory leeches, fork bombs, I/O swappers, etc.
</ul>
<p>This is where containers come into play - this allows you share your machine between different applications by allocating required portion of resources for each of them.
<h2>Back to Kubernetes</h2>
Kubernetes supports defining limit enforcement on two resource types: CPU and RAM. For each container you can provide <em>requested</em>, i.e. minimum required, amount of CPU and memory and a <em>limit</em> that container should not pass. <em>Requested</em> is also used for pod scheduling to ensure that a node will provide minimum amount of resources that pod requested. All these parameters are of course <a href="http://kubernetes.io/docs/user-guide/compute-resources/#how-pods-with-resource-limits-are-run">translated</a> to docker parameters under the hood.
<p>Since Kubernetes is quite a new gorilla in the block, I decided to test how enforcement behaves to get first hand experience with it.
<p>So first I created a container cluster on GKE with Kubernetes 1.1.8:
<pre class="code prettyprint">
gcloud container clusters create limits-test --machine-type n1-highcpu-4 --num-nodes 1
</pre>
<p>Now lets see what we got on our node (scroll right):
<pre class="code prettyprint">
$ kubectl describe nodes
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
───────── ──── ──────────── ────────── ─────────────── ─────────────
kube-system fluentd-cloud-logging-gke-limits-test-aec280e3-node-2tdw 100m (2%) 100m (2%) 200Mi (5%) 200Mi (5%)
kube-system heapster-v11-9rqvl 100m (2%) 100m (2%) 212Mi (5%) 212Mi (5%)
kube-system kube-dns-v9-kbzpd 310m (7%) 310m (7%) 170Mi (4%) 170Mi (4%)
kube-system kube-ui-v4-7q12m 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%)
kube-system l7-lb-controller-v0.5.2-imjry 110m (2%) 110m (2%) 70Mi (1%) 120Mi (3%)
Allocated resources:
(Total limits may be over 100%, i.e., overcommitted...)
CPU Requests CPU Limits Memory Requests Memory Limits
──────────── ────────── ─────────────── ─────────────
720m (18%) 720m (18%) 702Mi (19%) 752Mi (21%)
</pre>
<p>That's quite interesting already - the minimal resource overhead of Kubernetes is 720 millicores of CPU and 702 megabytes of RAM (not including <code>kubelet</code> and <code>kube-proxy</code> of course). However second node and on will only run one <em>daemon</em> pod - <code>fluentd</code> for log collection, so the resource reservation will be significantly lower.
<h2>CPU</h2>
Kubernetes <a href="https://github.com/kubernetes/kubernetes/blob/release-1.2/docs/proposals/resource-qos.md#compressible-resource-guarantees">defines</a> CPU resource as <em>compressible</em>, i.e. a pod can get larger part of CPU share if there is available CPU and this can be changed back on the fly, without process restart/kill.
<p>I've created a simple <a href="https://gist.github.com/haizaar/91469f5c4dfdef1f1965#file-cpu_loader-py">CPU loader</a> that calculates squares of integers from 1 to 1000 in loop on every core and prints loops/seconds number; packaged it into a <a href="https://hub.docker.com/r/haizaar/cpu-loader/">docker image</a> and launched into k8s using the following pod file:
<pre class="code prettyprint">
apiVersion: v1
kind: Pod
metadata:
name: cpu-small
spec:
containers:
- image: docker.io/haizaar/cpu-loader:1.1
name: cpu-small
resources:
requests:
cpu: "500m"
</pre>
<p>I've created another pod similar to this - just called it <code>cpu-large</code>. Attaching to pods shortly afterwards, I saw that they get a fair share of CPU:
<pre class="code prettyprint">
$ kubectl attach cpu-small
13448 loops/sec
13841 loops/sec
13365 loops/sec
13818 loops/sec
14937 loops/sec
$ kubectl attach cpu-large
14615 loops/sec
14448 loops/sec
14089 loops/sec
13755 loops/sec
14267 loops/sec
</pre>
<p>That makes sense - they both requested only .5 cores and the rest was split between them, since nobody else was interested.
So in total this ode can crunch ~30k loops/second. Now lets make <code>cpu-large</code> to be really large and reserve at least 2.5 cores for it by <a href="https://gist.github.com/haizaar/91469f5c4dfdef1f1965#file-cpu-large-yaml">changing</a> its <code>requests.cpu</code> to 2500m and re-launching it into k8s. According to our settings, this pod now should be able to crunch at least ~25k loops/sec:
<pre class="code prettyprint">
$ kubectl attach cpu-large
23310 loops/sec
23000 loops/sec
25822 loops/sec
23834 loops/sec
25153 loops/sec
24741 loops/sec
</pre>
<p>And this is indeed the case. Lets see what happened to <code>cpu-small</code>:
<pre class="code prettyprint">
$ kubectl attach cpu-small
30091 loops/sec
28609 loops/sec
30219 loops/sec
27051 loops/sec
27885 loops/sec
29091 loops/sec
28699 loops/sec
18216 loops/sec
4213 loops/sec
4188 loops/sec
4296 loops/sec
4347 loops/sec
4141 loops/sec
</pre>
<p>First it got all of the CPU while I was re-launching <code>cpu-large</code>, but once the latter was up, the CPU share for <code>cpu-small</code> was reduced. Together they will produce the same ~30k loops/second, but we now control the share ratio.
<p>What about limits? Well, <a href="https://groups.google.com/forum/#!searchin/google-containers/limits/google-containers/rVqIY3d0yWU/6qSdRmlnAwAJ">turns out</a> that currently limits are not enforced. This is not a big problem for us, because in our deployment strategy we prefer to provide minimum required CPU share for every pod and for the rest - be my guest. However at this point I was glad I did this test, since the documentation was misleading with regards to CPU limits.
<h2>RAM</h2>
The RAM resource is <em>uncompressible</em>, because there is no way to throttle process on memory usage or ask it gently to <em>unmalloc</em> some of it. That's why if a process reaches RAM limit, it's simply killed.
<p>To see how it's enforced in practice, I, again, created a simple <a href="https://gist.github.com/haizaar/607f43e282c4e0f8737a#file-mem_loader-py">script</a> that allocates memory in chunks up to predefined layout.
<p>First I've tested how <code>requests.memory</code> are enforced. I've created the following <code>mem-small</code> pod:
<pre class="code prettyprint">
apiVersion: v1
kind: Pod
metadata:
name: mem-small
spec:
containers:
- image: docker.io/haizaar/mem-loader:1.2
name: mem-small
resources:
requests:
memory: "100Mi"
env:
- name: MAXMEM
value: "2147483648"
</pre>
<p>and launched it. I happily allocated 2GB of RAM and stood by. Then I created <code>mem-large</code> pod with similar configuration where <code>requests.memory</code> is set to "2000Mi". After I launched the large pod, the following happened:
<ul>
<li><code>cpu-large</code> started allocating the desired 2GB RAM.
<li>Since my k8s node only had 3.6GB RAM, system froze for dozen seconds or so.
<li>Since there was no free memory in the system, kernel Out Of Memory Killer kicked in and killed <strong><code>mem-small</code></strong> pod:
</ul>
<pre class="code prettyprint">
[ 609.739039] Out of memory: Kill process 5410 (python) score 1270 or sacrifice child
[ 609.746918] Killed process 5410 (python) total-vm:1095580kB, anon-rss:1088056kB, file-rss:0kB
</pre>
<p>I.e. enforcement took place and my small pod was killed, since it consumed more RAM than requested and other pod was eligibly requesting memory. However such behavior is unsuitable in practice since it causes "stop-the-world" effect for everything that runs on particular k8s node.
<p>Now lets see how <code>resource.limits</code> are enforced. To verify that, I've killed by of my pods, and changed <code>mem-small</code> as follows:
<pre class="code prettyprint">
apiVersion: v1
kind: Pod
metadata:
name: mem-small
spec:
containers:
- image: docker.io/haizaar/mem-loader:1.2
name: mem-small
resources:
requests:
memory: "100Mi"
limits:
memory: "100Mi"
env:
- name: MAXMEM
value: "2147483648"
</pre>
<p>After launching it I saw the following on it's output:
<pre class="code prettyprint">
Reached 94 megabytes
Reached 95 megabytes
Reached 96 megabytes
Reached 97 megabytes
Reached 98 megabytes
Reached 99 megabytes
Reached 99 megabytes
Reached 99 megabytes
Killed
</pre>
<p>I.e. The process was immediately killed after reaching its RAM limit. There is a nice evidence to that in <code>dmesg</code> output:
<pre class="code prettyprint">
[ 898.665335] Task in /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea killed as a result of limit of /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea
[ 898.689794] memory: usage 102400kB, limit 102400kB, failcnt 612
[ 898.697490] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
[ 898.705930] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[ 898.713672] Memory cgroup stats for /214bc7c0bcdb1d0bf10b8ab4cff06b451850f9af0894472a403412ea295324ea: cache:84KB rss:102316KB rss_huge:0KB mapped_file:4KB writeback:0KB inactive_anon:4KB active_anon:102340KB inactive_file:20KB active_file:16KB unevictable:0KB
[ 898.759180] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 898.768961] [ 6679] 0 6679 377 1 6 0 -999 sh
[ 898.778387] [ 6683] 0 6683 27423 25682 57 0 -999 python
[ 898.788280] Memory cgroup out of memory: Kill process 6683 (python) score 29 or sacrifice child
</pre>
<h2>Conclusions</h2>
Kubernetes documentation is a bit misleading with regards to <code>requests.limits.cpu</code>. Nevertheless this mechanism looks perfectly useful for application. All of the code and configuration used in this post is available in the following gists:
<ul>
<li> <a href="https://gist.github.com/haizaar/91469f5c4dfdef1f1965">CPU Loader</a>
<li> <a href="https://gist.github.com/haizaar/607f43e282c4e0f8737a">Mem Loader</a>
</ul>
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-29951790641005064982016-04-18T01:44:00.002+10:002018-08-02T17:24:07.323+10:00Kubernetes cluster access by fixed IP<p>If you:
<ul>
<li>Have Kubernetes cluster running in GKE
<li>Connected GKE to your company network through VPN
<li><strong>Puzzled how to assign a fixed IP to particular k8s service</strong>
</ul>
<p>Then read on.
<h3>Update July 2018</h3>
<p>GCP/GKE now <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing">supports</a>
Internal Load Balancing for GKE clusters. I.e. now you can simply request a fixed IP on your network that will
route to your cluster service.
<p>The only limitation is that it can not multiplex several services under a single IP (under different ports),
therefore I still utilize the "fixed-ip-proxy" as described below, but now it has a static configuration that points to
the fixed IP of Internal Load Balancers. Alternatively one can do the above multiplexing inside K8s itself
(using e.g. nginx or <a href="https://github.com/kubernetes/contrib/tree/master/for-demos/proxy-to-service">proxy-to-service</a>)
and use internal LB to expose this multiplexing service.
<h3>Prologue</h3>
The ideal solution would be to configure k8s service to use GCP LoadBalancer and
have the latter to provide private IP only. However as of April 2016,
LoadBalancers on GCP do not provide an option for private IP only, though GCP
solution engineers said this feature "is coming".
<p>Therefore the only option we have it to run a dedicated VM with fixed IP
and proxy traffic through it.
<h3>The approach</h3>
Kubernetes service itself provides two relevant ways to access pods behind it:
<dl>
<dt>ClusterIP</dt>
<dd>
By default, every service has a virtual ClusterIP (which can be manually set to a
predefined address) which can be used to access pods behind the service. However
for this to work, a client has to have kube-proxy running on its host as explained
<a href="http://kubernetes.io/docs/user-guide/services/#virtual-ips-and-service-proxies">here</a>.
</dd>
<dt>NodePort</dt>
<dd>
A k8s service can be configured to expose a certain pod on every k8s node which
will redirected to the service's pods (this comes on top of ClusterIP).
</dd>
</dl>
<p>ClusterIP approach obviously is not feasible outside the h8s cluster, so we only have left
with NodePort approach. The problem is that k8s node IPs are not static and may change.
That's why we need a dedicated VM which has a fixed IP.
<p>After we have a VM, we can either
<ul>
<li> Join it to the k8s cluster, so service's NodePort will be exposed on the VM's fixed IP as well.
<li> Run a reverse HTTP proxy on VM to forward traffic on k8s nodes together with a script that monitors k8s nodes and updates proxy configuration when necessary.
</ul>
<p>I chose the second option because it allows a single VM to proxy requests for
multiple k8s clusters and is easier to setup.
<h3>The setup</h3>
<h4>Create an instance</h4>
Lets create an VM and assign it a static IP. The below is my interpretation of
the <a href="https://cloud.google.com/compute/docs/instances-and-network#set_a_static_target_ip_address">official</a> guide.
<p>Create an instance first:
<pre class="code prettyprint">
gcloud compute instances create fixed-ip-proxy --can-ip-forward
</pre>
The last switch is crucial here.
<p>I chose IP for my testing cluster to be 10.10.1.1. Lets add it to the instance:
<pre class="code prettyprint">
cat <<EOF >>/etc/network/interfaces.d/eth0-0
auto eth0:0
iface eth0:0 inet static
address 10.250.1.1
netmask 255.255.255.255
EOF
</pre>
<p>Now change <code>/etc/network/interfaces</code> and make sure that
<code>source-directory /etc/network/interfaces.d</code> line <strong>comes last</strong>. Apply your new
configuration by running:
<pre class="code prettyprint">
sudo service networking restart
</pre>
<p>The final step is to instruct GCE to forward traffic destined to 10.250.1.1 to
the new instance:
<pre class="code prettyprint">
gcloud compute routes create fixed-ip-production \
--next-hop-instance fixed-ip-proxy \
--next-hop-instance-zone us-central1-b \
--destination-range 10.10.1.1/32
</pre>
<p>To add more IPs (adding dedicated IP per cluster is a good practice), add another
file under <code>/etc/network/interfaces.d/</code> and add a GCE route.
<h4>NGINX configuration</h4>
Install NGINX:
<pre class="code prettyprint">
sudp apt-get install nginx
</pre>
<p>Install Google Cloud Python SDK:
<pre class="code prettyprint">
sudo easy_install pip
sudo pip install --upgrade google-api-python-client
</pre>
<p>Now download the IP watcher script:
<pre class="code prettyprint">
sudo wget -O /root/nginx-ip-watch https://gist.githubusercontent.com/haizaar/f19bdf9e5a6e278c57b96cce945b4fd9/raw/79f11225825607ba78ba84221d27439c1669a492/nginx-ip-watch
sudo chmod 755 /root/nginx-ip-watch
</pre>
<p><strong>NOTE:</strong> You are downloading my script that will run as root on your machine - read its contents first!
<p>Test the script:
<pre class="code prettyprint">
$ sudo /root/nginx-ip-watch -h
usage: Watch GKE node IPs for changes [-h] -p PROJECT -z ZONES
name gke-prefix listen-ip listen-port
target-port
positional arguments:
name Meaningful name of your forwarding rule
gke-prefix GKE node prefix to monitor and forward to
listen-ip IP listen on
listen-port Port to listen on
target-port IP listen on
optional arguments:
-h, --help show this help message and exit
-p PROJECT, --project PROJECT
Project to list instances for
-z ZONES, --zones ZONES
Zones to list instances for
</pre>
<p>Now lets setup NGINX to listen for HTTP traffic on <code>10.10.1.1:5601</code> and forward it to
GKE <code>testing</code> cluster nodes on port <code>30601</code> by adding the following to
<code>/etc/cron.d/nginx-ip-watch</code>:
<pre class="code prettyprint">
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
* * * * * root /root/nginx-ip-watch kibana-testing -p my-project -z us-central1-a gke-testing 10.10.1.1 5601 30601
</pre>
<p>After that, within one minute, your forwarding should be up and running. For more services,
just keep adding more lines in the cron file. This will work well for a dozen or so services.
After that, I would refactor the solution to issue only one <code>gcloud compute instances list</code>
command per minute.
<p>Since we are using NGINX in load-balancer mode, checking GKE hosts only once a minute is
good-enough even during cluster upgrades - NGINX will detect and blacklist a shutting down GKE
node by itself.
<h3>Epilogue</h3>
Create a snapshot of your instance to keep a backup of your work every time you change it.
Don't forget to issue <code>sync</code> command on the system before taking snapshot of the disk.
<br><br>
<h3>Update</h3>
<ul>
<li>The first version of my script used <code>gcloud</code> command line util to fetch instances list. It <a href="https://groups.google.com/forum/?fromgroups#!topic/google-cloud-sdk/TzZf0_iD8xg">turned out</a> that <code>gcloud</code>
performs logging to <code>~/.config/gcloud/logs</code> and spits 500KB on every invocation. To mitigate this, I've updated my script to use Google Cloud Python SDK to bypass gcloud util completely.
<li>As Vadim points out below, you can now <a href="https://cloud.google.com/compute/docs/instances-and-network#specify_an_internal_ip_address_at_instance_creation">specify</a> fixed internal IP during instance creation time. Though you'll still need the setup above if you want to have more then one IP per instance.
</ul>
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-39403624677673451152016-04-05T08:05:00.001+10:002016-07-14T09:12:14.247+10:00Persistent storage for ElasticSearch master nodes? <p>ElasticSearch master nodes hold cluster state. I was trying to understand whether these nodes are required to have persistent storage or they can recover from whatever exists on data nodes? The short answer is: <em>probably</em> yes.
<h3>Update</h3>
You better have persistent disk for your ES data nodes - read <a href="http://tech.zarmory.com/2016/07/you-better-have-persistent-storage-for.html">here</a> why.
<p>Below I describe tests what I've done. But before that - some background on how did I get to this question at first place.
<h2>ElasticSearch on Kubernetes</h2>
We are working on running ElasticSearch 2.x on Kubernetes on Google Container Engine. There are two options to store a data for a container:
<dl>
<dt>EmptyDir
<dd>Part of the local storage on Kubernetes node is allocated for the pod. If pod's controller restarts, the data survives. If pod is killed - the data is lost.
<dt>gcePersistentDisk
<dd>Compute engine persistent disk (created in advance) can be attached to a pod. The data persists. However there is a limitation - as of Kubernetes 1.2, a ReplicaSet can not attach a different disks to each pods that it creates, thus to run ElasticSearch data nodes, for example, you need to create a separate ReplicaSet (with size of 1) for each ES data node.
</dl>
<p>ES data nodes should have persistent disk - this is no brainer. However with regards to ES master nodes it's not clear. I've tried to understand where master nodes persist cluster state, and <a href="https://discuss.elastic.co/t/where-cluster-metadata-is-stored/44540/3">this</a> thread states "on every node including client nodes". There is also a resolved <a href="https://github.com/elastic/elasticsearch/issues/8823">issue</a> about storing index metadata on data nodes.
<h2>Run, Kill, Repeat</h2>
So lets how it behaves in reality.
<p>I created two node Kubernetes 1.2 cluster running n1-standard-2 instances (2 CPUs, 7.5GB RAM). And used <a href="https://github.com/pires">Paulo Pires</a> Kubernetes setup for ElasticSearch:
<pre class="code prettyprint">
$ git clone https://github.com/pires/kubernetes-elasticsearch-cluster.git
$ cd kubernetes-elasticsearch-cluster
$ vim es-data-rc.yaml # set replicas to 2
$ vim es-master-rc.yaml # set replicas to 3
$ vim es-svc.yaml # set type to ClusterIP
</pre>
<p>Lets launch it in the air:
<pre class="code prettyprint">
$ for i in *.yaml; do kubectl create -f $i; done
$ sleep 1m; kubectl get pods
NAME READY STATUS RESTARTS AGE
es-client-ats2b 1/1 Running 0 1h
es-data-teodq 1/1 Running 0 1h
es-data-zwml2 1/1 Running 0 1h
es-master-3bosq 1/1 Running 0 1h
es-master-a47om 1/1 Running 0 1h
es-master-c1dy1 1/1 Running 0 1h
</pre>
<p>We are all good. Lets ingest some data and alter cluster settings:
<pre class="code prettyprint">
$ CLIENTIP=$(kubectl describe pods es-client |grep '^IP' |head -n 1|awk '{print $2}')
$ curl -XPUT $CLIENTIP:9200/_cluster/settings?pretty -d '{"transient": {"discovery.zen.minimum_master_nodes": 2}}'
{
"acknowledged" : true,
"persistent" : { },
"transient" : {
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
}
}
$ curl -XPUT $CLIENTIP:9200/tweets/tweet/1 -d '{"foo": "bar"}'
{
"_id": "1",
"_index": "tweets",
"_shards": {
"failed": 0,
"successful": 2,
"total": 2
},
"_type": "tweet",
"_version": 1,
"created": true
}
$ curl $CLIENTIP:9200/_cluster/health?pretty
{
"cluster_name" : "myesdb",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
</pre>
<p>The data is there and the cluster is green. Now lets kill all of the masters and recreate them (their data disk will be lost):
<pre class="code prettyprint">
$ kubectl delete -f es-master-rc.yaml
replicationcontroller "es-master" deleted
$ kubectl create -f es-master-rc.yaml
replicationcontroller "es-master" created
</pre>
<p>After dozens of seconds, new masters will be up again and we'll see the following in the leader's log:
<pre class="code prettyprint">
[2016-04-04 15:57:44,880][INFO ][cluster.service ] [Elaine Grey] new_master {Elaine Grey}{5NIL5jBbTYadefGzjDLb5A}{10.224.1.7}{10.224.1.7:9300}{data=false, master=true}, added {{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false},{Typhoid Mary}{cWmlEtHuSdCImjHMNM6FsA}{10.224.1.3}{10.224.1.3:9300}{master=false},{Slug}{fQPe2C1FSH2UkuveBFJtbw}{10.224.1.5}{10.224.1.5:9300}{data=false, master=false},}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-04-04 15:57:45,056][INFO ][node ] [Elaine Grey] started
[2016-04-04 15:57:45,058][INFO ][cluster.service ] [Elaine Grey] added {{Stacy X}{U4sn1pkWRlGV-zVMW1OeAA}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true},}, reason: zen-disco-join(pending joins after accumulation stop [election closed])
[2016-04-04 15:57:45,711][INFO ][gateway ] [Elaine Grey] recovered [0] indices into cluster_state
[2016-04-04 15:57:45,712][INFO ][cluster.service ] [Elaine Grey] added {{Kiss}{MaNkKlQWR82QHFVhz38Ohg}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true},}, reason: zen-disco-join(join from node[{Kiss}{MaNkKlQWR82QHFVhz38Ohg}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true}])
[2016-04-04 15:57:46,077][INFO ][gateway ] [Elaine Grey] auto importing dangled indices [tweets/OPEN] from [{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false}]
[2016-04-04 15:57:47,073][INFO ][cluster.routing.allocation] [Elaine Grey] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[tweets][3]] ...]).
[2016-04-04 15:57:47,567][INFO ][cluster.routing.allocation] [Elaine Grey] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[tweets][3]] ...]).
[2016-04-04 15:58:16,355][INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Elaine Grey] updating discovery.zen.minimum_master_nodes from [-1] to [2]
</pre>
<p>So we see that:
<ul>
<li> The new master recovered 0 indices from cluster state - i.e. the cluster state was indeed lost.
<li> The new master auto imported existing indices, which were "dangling" in ES terminology.
<li> It also restored our transient quorum setting
</ul>
<p>And our cluster is green:
<pre class="code prettyprint">
$ curl $CLIENTIP:9200/_cluster/health?pretty
{
"cluster_name" : "myesdb",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
</pre>
<p>Now lets kill master without destroying their disks and see it behaves any differently:
<pre class="code prettyprint">
$ for pod in $(kubectl get pods |grep es-master |awk '{print $1}'); do do kubectl exec $pod killall java & done
</pre>
The master log shows the following:
<pre class="code prettyprint">
[2016-04-04 16:33:55,695][INFO ][cluster.service ] [Neptune] new_master {Neptune}{51QU1wf4T2Ky9QUzK5MkEw}{10.224.1.7}{10.224.1.7:9300}{data=false, master=true}, added {{Slug}{fQPe2C1FSH2UkuveBFJtbw}{10.224.1.5}{10.224.1.5:9300}{data=false, master=false},{Typhoid Mary}{cWmlEtHuSdCImjHMNM6FsA}{10.224.1.3}{10.224.1.3:9300}{master=false},{Lyja}{CUyGl7w-R86qcOsNSj0xPA}{10.224.1.4}{10.224.1.4:9300}{master=false},{Comet Man}{cOLRxsOKTC2OYR6Kuiplxw}{10.224.1.6}{10.224.1.6:9300}{data=false, master=true},}, reason: zen-disco-join(elected_as_master, [1] joins received)
[2016-04-04 16:33:55,867][INFO ][node ] [Neptune] started
[2016-04-04 16:33:56,428][INFO ][gateway ] [Neptune] recovered [1] indices into cluster_state
[2016-04-04 16:33:57,233][INFO ][cluster.routing.allocation] [Neptune] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[tweets][0], [tweets][0]] ...]).
[2016-04-04 16:33:57,745][INFO ][cluster.routing.allocation] [Neptune] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[tweets][4]] ...]).
[2016-04-04 16:34:01,417][INFO ][cluster.service ] [Neptune] added {{Nicole St. Croix}{5VcQT4H8RHeAf3R0PW3K4A}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true},}, reason: zen-disco-join(join from node[{Nicole St. Croix}{5VcQT4H8RHeAf3R0PW3K4A}{10.224.1.8}{10.224.1.8:9300}{data=false, master=true}])
[2016-04-04 17:01:55,811][INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Neptune] updating discovery.zen.minimum_master_nodes from [-1] to [2]
</pre>
<p>So 1 index was recovered from the cluster state and there are no dangling indices this time.
<h2>Conclusions</h2>
While master nodes were able to recover both indices and cluster transient settings, I was testing only the most simple scenario. This is not enough to take a decision on whether we should maintain persistent disks for master nodes. On the other hand, if we do have persistent disks for master - do we need to backup the metadata?
And what about ES 5.x? - one of the promised features is that master will hold an ID of the latest index change to prevent stale data nodes becoming primaries during network partitioning. This kind of metadata can not be stored on data nodes.
<p>I'll update this post when I'll have the answers.
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-44758146073066676552016-03-25T11:20:00.001+11:002016-03-25T11:20:55.795+11:00Caveat with ElasticSearch nGram tokenizer<em>Finally got some time to blog about ElasticSearch. I use it extensively during the last two years, but my findings are rather lengthy. Finally I've got something small to share.</em>
<p>ElasticSearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html">nGram tokernizer</a> is very useful for efficient substring matching (at cost of index size of course). For example, I have an event message field like this
<pre class="code prettyprint">
/dev/sda1 has failed due to ...
</pre>
and I would like to find all events of failure for all SCSI disks. One option is to store message field as not analyzed string (i.e. a one single term) and use <code>wildcard</code> query:
<pre class="code prettyprint">
GET /events
{
"query": {
"wildcard": {
"message.raw": {
"value": "/dev/sd?? has failed*"
}
}
}
}
</pre>
This will do the work perfectly, but to complete it, ElasticSearch will scan every value of message field looking for the pattern during search time. Once number of documents gets big enough, it will become slow.
<p>One solution is to split message to substrings during indexing time, with (2,20) for (min, max) in our example:
<pre class="code prettyprint">
# Analyzer definition in settings
"analysis": {
"analyzer": {
"substrings": {
"tokenizer": "standard",
"filter": ["lowercase", "thengram"]
}
},
"filter": {
"thengram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20
}
}
}
# message field definition in mappings
"message": {
"type": "string",
"index": "analyzed",
"analyzer": "substrings"
}
</pre>
and use <code>match_phrase</code> query:
<pre class="code prettyprint">
GET events/_search
{
"query": {
"match": {
"message": {
"query": "dev sd has failed",
"type": "phrase",
}
}
}
}
</pre>
<h2>The caveat</h2>
The above query will return weirdly unrelevant results and, at first glance, it's not obvious why. The caveat is, that our custom analyzer is applied both during indexing <em>and</em> search. So instead of searching for sequence of terms "dev", "sd", "has", "failed"; we are searching for sequence "de", "ev", "dev", "sd", "ha", "as", "has", etc. To fix this we need to tell Elastic to use different tokernizer during search (and search only). This can be done either by adding <code>"analyzer": "standard"</code> to query itself (which is error phone, since can be easily forgotten) or specified in mapping definition:
<pre class="code prettyprint">
"message": {
"type": "string",
"index": "analyzed",
"analyzer": "substrings",
"search_analyzer": "standard"
}
</pre>
<h2>Worth it?</h2>
I took 1,000,000 events sample data and run both <code>wildcard</code> and <code>phrase</code> queries that match 1,000 doc subset out of it. While for such a small data set, both are fast, the difference it quite striking nevertheless:
<ul>
<li><code>wildcard</code> query - 30ms
<li> <code>phrase</code> query - 5ms
</ul>
<p>Times 6 speed up! Another bonus for using <code>phrase</code> query is that you can get results highlighting (that is not supported for <code>wildcard</code> queries).
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0tag:blogger.com,1999:blog-528513339167803855.post-7495214659588138312016-02-25T09:32:00.000+11:002016-02-25T09:32:11.833+11:00Patching binariesMy friend asked for help - he has a legacy system that he wants to migrate to new hardware. His Linux OS is 10 years old and it becomes more and more challenging to find hardware to run it on. Long story short, I was asked to make his ten years old binaries to run on a modern Ubuntu.
<p>Fortunately Linux has very impressive <a href="http://unix.stackexchange.com/questions/47495/oldest-binary-working-on-linux">ABI compatibility</a>, so my job was down to arranging executables and their dependent libraries. Well, almost.
<p>There three ways of telling an executable (or actually <code>ld.so</code> interpreter) where to search for its libraries
<ul>
<li>Setting <em>rpath</em> on the executable itself.
<li>Setting <code>LD_LIBRARY_PATH</code> environment variable.
<li>Changing system wide configuration for <code>ld.so</code> to look into additional directories.
</ul>
<p>The binaries were setuid, and thus <code>LD_LIBRARY_PATH</code> was ruled out.
<p>Next, I've tried to overcome it by putting libraries in <code>/opt/old-stuff/lib</code> and adding it to <code>/etc/ld.so.conf.d/z-old-stuff.conf</code>. This gave me some progress, but I hit the wall with naming collisions - my oldy binary was relying on older <code>libreadline</code> and I had two <code>libreadline.so.5</code> libs - one in <code>/lib</code> and one in <code>/opt/old-stuff/lib</code>. The latter was obviously further down the search path, since otherwise it would break practically every command-line tool in the system.
<p>So I needed to make my binary to use its own specific version of <code>libreadline</code> and to leave others using the default one. The only way to go was using rpath. Fortunately there is nifty utility out there called <a href="https://nixos.org/patchelf.html"><code>patchelf</code></a>:
<pre class="code prettyprint">
patchelf --set-rpath /opt/old-stuff/lib /opt/old-stuff/bin/foo
</pre>
That almost did the trick. The caveat was that <code>foo</code> was using other library and only that library itself utilized <code>libreadline</code>. So the solution was to set rpath on all libraries as well:
<pre class="code prettyprint">
for file in /opt/old-stuff/lib/*; do
patchelf --set-rpath /opt/old-stuff/lib "$file"
done
</pre>
Overall that was quite a shift from my current daily programming routine. I did not have to think about linkers for quite a lot of time by now and it was fun to have a taste of this stuff back again.
<script>prettyPrint()</script>Zaar Haihttp://www.blogger.com/profile/03019277728569320787noreply@blogger.com0