Tuesday, November 28, 2023

Practical DRY vs code readability

About 13 ago, Django docs formally introduced me to the DRY principal - Don't Repeat Yourself, as in "Don't pollute your project with almost identical pieces of code, but rather refactor them into a shared, reusable code base".

It sounds easy and is indeed something people have doning since at least the Multics days. However trying to enforce this principal on myself was quite a struggle. Whenever I spotted repetetion emerging while writing a code, I felt compelled to refactor it immediately. Needless to say, attempting to build a shared functionality for something still in heavy development is tiresome. The context switch between writing what you want to create and refactoring as you go wore me out heavily.

It took me some time to come terms with the perfectionist in myself that I do not have to DRY, that it's a guide, not a law.

Nowadays, when prototyping, I shamelessly repeat myself. Once I get the core functionality working, I review it and apply the following priciples of my own:

  1. Readability trumps DRY
  2. Repeating something twice is totally fine
  3. If repeating logic appears in thee or more places - I'll refactor it out IF it doesn't sacrifice readbility
  4. If I can't refactor with sacrificing readability - I probably have to redesign my code all together

Kick-start your new project with terraformed GCP-GitHub actions key-less auth

In the past, to gain access to our GCP env inside GitHub actions, we used GitHub secrets to store GCP service account keys. It worked but for me it always felt like walking a thin line. Thankfully now GitHub support OICD tokens and we can setup GCP Workload Identity Federation to grant key-less access for our GitHub actions to our GCP environment.

There are plenty of guides out there on how to do it but it takes some effort to follow them, particularly if you want to terraform everthing - it adds the extra work of bootstrapping terraform configuration itself (using local state to create remote state storage, upload state, switch to impersonalization, etc.). Hence, after repeating this a couple of times I decided to have repository template to save time to me and hopefully you as well.

Here is it: https://github.com/zarmory/gcp-github-federation-terraform

What do you get?

After cloning and configuring this repo, with a couple of commands, you'll get the following:
  • Terraform state bucket created
  • Terraform service account created and permissions assigned
  • GitHub OIDC federation set up
  • Sample GitHub Actions workflows to validate and apply your configuration

All in all just ~100 lines of terraform code, including comments. Basically, just clone, configure and start building.

All of the code is meant to serve as a working example to encourage you hack and modify (rather than highly abstracted resuable module of sorts).

This is merely an annoucement post - if interested, please continue to the repo README for further details.

Thursday, November 16, 2023

Getting GNU Make cloud-ready

The title looks cheeky, but so is the issue.

After release of GNU Make 4.2, the team went quiet for 4 years; however since the COVID times they've been releasing a new minor version about once a year. Surprisingly, upgrade to GNU Make 4.4 caused issues with Google Cloud SDK - the gcloud command.

When working on internal projects, I like to have CLOUDSDK_CORE_PROJECT environment variable populated, but I don't wan't to preset it to a fixed value, because every person on the team have their own playground project, which I want the tool to use as the deployment target. So I came with the following Makefile:


CLOUDSDK_CORE_PROJECT ?= $(shell gcloud config get-value project)
export CLOUDSDK_CORE_PROJECT

release:
	@echo Deploying to project $(CLOUDSDK_CORE_PROJECT)

This way my toolchain will pick user's default project, which usually points to their playground. And if someone wants things done differently they can set CLOUDSDK_CORE_PROJECT explicity, e.g. through .envrc - nice and simple.

This worked very well for years until I upgraded my system and started hitting the following cryptic errors when running make:


$ make
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project
Deploying to project

After quite a bit of reading and bisecting upgraded packages (which is relatively easy with NixOS) I found that Make 4.4.x is the culprit. Reading through the release notes it was surprised to find a long list of backward incompatability warnings - quite astonishing for such a mature and feature-complete tool like GNU Make. Among them, the following paragraph caught my attention:

Previously makefile variables marked as export were not exported to commands started by the $(shell ...) function. Now, all exported variables are exported to $(shell ...). If this leads to recursion during expansion, then for backward-compatibility the value from the original environment is used. To detect this change search for 'shell-export' in the .FEATURES variable.

Bingo! After that I could quickly reproduce the issue:


$ CLOUDSDK_CORE_PROJECT= gcloud config get-value project
ERROR: (gcloud.config.get-value) The project property is set to the empty string, which is invalid.
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project

So what happens is:

  • CLOUDSDK_CORE_PROJECT is not set, so Make calls the default
  • Since this variable is export'ed, Make makes it available to the shell assigning the empty string a value, which breaks gcloud

The fix is simple, though hacky:


CLOUDSDK_CORE_PROJECT ?= $(shell unset CLOUDSDK_CORE_PROJECT; gcloud config get-value project)
export CLOUDSDK_CORE_PROJECT

release:
	@echo Deploying to project $(CLOUDSDK_CORE_PROJECT)

I.e. if the variable is set, the default will not be called. But if it's not, we clear the empty variable from the subshell environment, thus preventing things from breaking.

Eventually trivial, such small issues can easily eat out several hours of one's time, hence I'm sharing this hopefully useful nugget of knowledge on how to make your Makefile cloud-ready :)

Friday, November 11, 2022

Pointing Mozilla SOPS into the right direction

Mozilla SOPS is a neat way to manage your secrets in git. I've been using it a lot in the last years in various project and so far I'm very happy with it.

Today I stumbled up a problem where sops refused to decode my file:


Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  projects/foo/locations/global/keyRings/foo-keyring/cryptoKeys/foo-global-key: FAILED
    - | Error decrypting key: googleapi: Error 403: Cloud Key
      | Management Service (KMS) API has not been used in project
      | 123xxxxxx before or it is disabled. Enable it by visiting
      | https://console.developers.google.com/apis/api/cloudkms.googleapis.com/overview?project=123xxxxxxx
      | then retry. If you enabled this API recently, wait a few
      | minutes for the action to propagate to our systems and
      | retry.
      | Details:
      | [
      |   {
      |     "@type": "type.googleapis.com/google.rpc.Help",
      |     "links": [
      |       {
      |         "description": "Google developers console API
      | activation",
      |         "url":
      | "https://console.developers.google.com/apis/api/cloudkms.googleapis.com/overview?project=123xxxxxxx"
      |       }
      |     ]
      |   },
      |   {
      |     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
      |     "domain": "googleapis.com",
      |     "metadata": {
      |       "consumer": "projects/123xxxxxxx",
      |       "service": "cloudkms.googleapis.com"
      |     },
      |     "reason": "SERVICE_DISABLED"
      |   }
      | ]
      | , accessNotConfigured

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

That is, sops complained that Google KMS service I use to encrypt/decrypt the keys behind the scenes is diabled in my proejct. Which didn't make sense - after all, I created KMS keys in that project so the service must be enabled. I inspected the project id 123xxxxxxx the error was referring to and was surprised to find out that it belongs to a project bar and not the project foo I was working on (and the one where KMS keys where stored at).

After checking environment variables, KMS key location in the encrypted file I had no other options but to try strace on sops binary to find out was causes sops to go with project bar instead of foo. And bingo - it looked at ~/.config/gcloud/application_default_credentials.json file which has quota_project_id parameter pointing straight to bar.

One easy fix is to run gcloud auth application-default set-quota-project foo. It basically tells Google SDK to use foo as a billing project when calling KMS service (KMS API distinguishes between calling project and resource-owning project as explained here. It works but it's a fragile solution - if you are working on several projects in parallel you need to remember to switch back and forth to the correct project since these particular applicatoin-default settings can not be controlled from environment variables.

What is there was a way to simply tell sops (and others) to use project owning the resource (the KMS key in my case) as a billing project as well? Apparently there is a way:


gcloud auth application-default login --disable-quota-project
...
Credentials saved to file: [~/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).
WARNING: 
Quota project is disabled. You might receive a "quota exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota project.
And voilĂ  - it works!

Friday, January 21, 2022

How to STOP deletion of a Cloud Storage bucket in GCP Cloud Console

What if you need to delete a Cloud Storage bucket with lots of objects (tens of thousands or more)? As per GCP docs your options are:

  • Cloud Console
  • Object Lifecycle policies

Cloud Console is very handy in this scenario - a couple of clicks and you are done. However, using it to delete a bucket is akin hiring an anonymous killer - once you kick off a job there is no way to stop it. But what if it was a wrong bucket? - Sitting and watching your precious data melting away is a bitter experience

As you see, the above UI has no "OMG! Wrong bucket! Stop It Please!" button. However there is a hack to still abort a deletion job (othwerise I wouldn't be writing this post, right? :)

To abort the deletion job all you need to do is to call your fellow cloud admin and ask him to deprive you, termporary (or not?) of write permissions to the bucket in question. Once your user is not able to delete objects from the in question, the deletion job will fail:

Cloud Console performs deletions on behalf of your user so once your permissions has been snipped it aborts the deleton job. I only wish there was a "Cancel" button in the Console UI save us from using this inconvenient hack.

Of course data that was deleted up until abortion is aleady gone (15,000 objects in my example above) and the only way to restore it, aside if you had backups, is to have Object Versioning being setup in advance.

Tuesday, December 28, 2021

Google Chrome with taskbar entry but no actual window - how to fix

I recently got a new laptop - Lenovo X1 Yoga Gen 6 and thought that it's now-or-ever opportunity to head-dive into NixOS which was something I cherished for a long time. Anyhow, I have different DPI displays and hence need to run Wayland. Things are still a bit shaky with Wayland, at least on NixOS with KDE but it's getting better every week! - that's why I'm running on the unstable channel.

Every now and then after upgrade it happens that Chromium (and Chrome) open up but don't show a window. There is a taskbar entry, they respond to right-click, show recent docs in the right-click pop-up, etc., but not matter what I do, there is no window shown which makes unusable of course. This is how it looks:

How to fix it?

TL;DR;


cd ~/.config/chromium/Default
cat Preferences |jq 'del(.browser.window_placement, .browser.app_window_placement)' |sponge Preferences

How did I figure it out? I'm no Chrome dev so I did it a CLI way:

  1. Copied my profile ~/.config/chromium aside, removed the original and checked that Chromium starts. I.e. it's a configuration issue
  2. Used binary search to determine which files in the profile cause the issue - namely, each time I rsync -av ~/.config/chromium{.old,}/Default/, removed some files, and checked if it helped. Eventually I figured out that Preferences file is the offender
  3. Now all was left is to compare the original and newly generated Preferences files. It's a single-line JSON file and I had to format it with jq tool first. Looking at the (huge) diff I was lucky to notice that .browser.window_placement configuration is different; and after copying Prefences from my original backup and dropping this attribute my Chromimum came back to life. Since I use Chromium web apps I had to reset .browser.app_window_placement as well
A bit of patience, a bit of luck and here we are! The same cure worked for Google Chrome. Hope this will help someone who stumbles on the similar matter. Of course if you have several Chrome/Chromium profiles you need to patch their Preferences too.

Update Jan 2022: Apparently the above hack works only partially and the issue kept triggering until Chromimum 97 landed in NixOS and it never happened since.

Friday, November 22, 2019

So you wanna host docs?

It's been long overdue to have an internal web server that would have an always up to date copy of our project documentation. We use Sphinx, and GitHub does renders RST to certain extent so I kept delaying this task till now, but it's about time to tie loose ends - so I embarked on a journey to find a way to host internal documentation with minimal maintenance efforts and costs.

With today's SaaS and cloud technologies it should've been easy right? Just watch my private repo on GitHub for changes in docs directory, pull it, run sphinx to build static HTML pages and then upload them somewhere - not too complicated, isn't it? Let's see how the story has unfolded.

ReadTheDocs.com

That's probably would've been a kinder choice - to support free software development. However I wanted my own theme and my own domain and that meant going for Advanced plan which is $150/month which is quite expensive for our small team where we write docs more often than reading them :) Sill, it would've worked well with minimal efforts to setup and they have no thrills 30-day trial.

Ruling out readthedocs.com SaaS I decided to setup docs hosting pipeline myself using one of the GCP tools - the cloud we use the most.

A convoluted trail through GCP

It was time to catch up on recent GCP products I have yet had a chance to try. To build the docs, Google Cloud Build sounded great - and it is indeed. They even have dedicated app on GitHub marketplace so that builds can be triggered from pull requests and build status reflected on GitHub. That worked pretty straight forward. For hosting I decided to upload docs on Google Cloud Storage and figure out later on how to host them privately. After some tinkering I ended up with the following cloudbuild.yaml:


steps:
  # Prepare builder image - speeds up the future builds
  # It just makes sure that docbuilder:current image is present on local machine.
  # I have a dedicated docbuilder for each version of docs/requirements.txt.
  # I could've just started each time from python docker and pip install requirements.txt
  # but lxml takes several minutes to build, which is waste...
  # This optimization brings build time from 7 minutes to 1!
  - name: alpine:3.10
    entrypoint: sh
    args:
      - "-c"
      - |
        set -ex
        apk add --no-cache docker
        cd docs
        IMAGE=gcr.io/$PROJECT_ID/doc-builder:$(sha1sum < requirements.txt | cut -f 1 -d ' ')
        if ! docker pull $$IMAGE; then
            docker build -t $$IMAGE .
            docker push $$IMAGE
        fi
        docker tag $$IMAGE docbuilder:current
  # Build the docs - reuse the image we just have built
  - name: docbuilder:current
    entrypoint: sh
    args:
      - "-c"
      - |
        cd docs
        make html
  # Publish docs
  # We can't use Cloud Build artifacts since they do not support recursive upload
  # https://stackoverflow.com/questions/52828977/can-google-cloud-build-recurse-through-directories-of-artifacts
  - name: gcr.io/cloud-builders/gsutil
    args: ["-m", "cp", "-r", "docs/_build/html/*", "gs://my-docs/$BRANCH_NAME/"]

This approach looked very promising, for example I can build docs for different branches and access them simply by URL paths suffixes. It also outlines of the Cloud Build strengths - you can just run shell scripts of your choice to do anything you like. Finally Cloud Build provides you with free 120 minutes a day meaning I can build my docs every 15 minutes without it costing me a penny still.

Unfortunately I hit a hosting dead end pretty quick. I want to use GCP Identity Aware Proxy (IAP) for guarding the access and it does not work with Cloud Storage yet, though it was something quite natural (for me) to expect it should've. I explored ideas about running a container that would mount Cloud Storage bucket and serve it behind IAP, but if I end up hosting a container I'll be better off just to build my docs into a static file server. I will have to give up on ability to host docs from multiple branches together but the solution of running a container in privileged mode with pre- and post- hooks to mount GCS through FUSE didn't sound very clean and would've deprived me from using Managed Cloud Run (more on that below). I briefly explored Cloud Filestore (not Firestore) path, but their minimum volume size is 1TB which is $200/month - such a waste.

Looks like I need to build my docs into a static-server container so why not trying to host it on Cloud Run? With amount of traffic to our docs it would only cost us... nothing since we'll stay well within the free tier. However lack of IAP support hit me again. Cloud Run supports Google Sign-In meaning it can validate your bearer tokens, but still no authentication proxy support. Hopefully they will implement one soon since it's highly anticipated, by me at least.

At that point I went back to IAP docs to re-conclude what are my options. AppEngine, GCE, or GKE they were. I obviously decided on GKE since I had a utility GKE cluster with some spare capacity I could leach on. I ruled out AppEngine - no one in my team including myself had any experience with it and with GKE option readily available I saw no reason start acquiring any.

From this point on it went pretty straight-forward. I created the following Dockerfile:


FROM python:3.7.5-alpine3.10 AS builder

RUN apk add --no-cache build-base libxml2-dev libxslt-dev graphviz
WORKDIR /build
COPY requirements.txt ./
RUN pip install --upgrade -r ./requirements.txt
COPY . ./
RUN make html

########################

FROM nginx:1.17.5-alpine AS runner

COPY --from=builder /build/_build/html /usr/share/nginx/html

And use the following build config:


steps:
  - name: gcr.io/cloud-builders/docker
    args: [
      "build", "-t", "gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA", "docs",
    ]
  - name: gcr.io/cloud-builders/docker
    args: [
      "push", "gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA",
    ]
  - name: gcr.io/cloud-builders/gke-deploy
    args:
      - run
      - --filename=docs/deploy  # You can pass a directory here, but you'll need to read gke-deploy code to find it out
      - --image=gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA
      - --location=us-central1-a
      - --cluster=my-cluster

My build time went up again to 7 minutes unfortunately, which I tried to mitigate by using Kaniko, but hit a show-stopper bug where it does not recognizes changes in files copied between stages. Hopefully they fix it soon. Either that or GCS will support IAP :). For the reference, the relevant Cloud Build step with Kaniko would've looked like this (instead docker build/push above):


 - name: gcr.io/kaniko-project/executor:latest
   args:
     - --cache=true
     - --cache-ttl=336h  # 2 weeks
     - --context=/workspace/docs
     - --dockerfile=/workspace/docs/Dockerfile
     - --destination=gcr.io/$PROJECT_ID/docs-server:$BRANCH_NAME-$COMMIT_SHA

My docs/deploy dir contained a single K8s deployment.yaml file to create K8s deployment object. gke-deploy can create one by default, but it also creates horizontal pod autoscaler which was really an overkill for my task. So here is my deployment:


apiVersion: apps/v1
kind: Deployment
metadata:
  name: docs-server
  labels:
    app: docs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: docs-server
  template:
    metadata:
      labels:
        app: docs-server
    spec:
      containers:
      - name: nginx
        image: gcr.io/my-project/docs-server:latest  # Will be overridden by gke-deploy
        ports:
        - containerPort: 80

At this point I had a pipeline that builds my docs into a static server and deploys it as a pod into one of my GKE clusters. The only thing that's left is to expose it to my team, securely though IAP. This is where GKE comes handy - you can request Load Balancer with SSL certificate and IAP directly through K8s manifests! Just follow the guides: 1, 2, and 3.

And here we are - I now have my private docs, on custom domain secured behind IAP to share with my GCP team mates. All in all even if I would run it on a dedicated GKE with a single f1-micro instance it would've cost me less than $20 per month, meaning that if I factor costs of my time to set it up, the price difference between host-your-own and ReadTheDocs Advanced plan would pay off in less than 2 years :)