Reading the news, sometimes it feels like the key challenge facing startup CTOs
is whether they pursue their service orchestration strategy or their autonomous
vehicle strategy first. That said, I do believe that any technology company with
more than a handful of engineers should indeed have an orchestration strategy,
and to that end I’ve been spending time lately getting more familiar with Kubernetes
and Google Container Engine (GKE)
The specific question I had in mind was whether I, personally, could get the level
of automation, integration and flexibility I get at work–where I’m benefitting from
significant investment into AWS AutosScaling Groups tooling, Jenkins integrations and
an entire ecosystem of development tooling–and get it for “almost free”, without
creating too much maintenance overhead for me to maintain.
To answer the question, I ported this blog to Kubernetes and GKE this weekend.
At this point, its entirely served from GKE, and the AWS infrastructure has been spun down.
(I’ll migrate systemsandpapers.com in the next week or
Create a GKE hosted Kubernetes cluster, using three n1-standard-1 instances
(each has 1.8 GB RAM).
Write a Dockerfile, initially a transliteration of my existing Packer
configuration which ran Nginx and the Go binary in the same VM.
(Previously I had been baking AMIs.)
Sync my blog’s shoddy codebase, named daedalus, from Github to Google Source Repositories,
on the reasonable but unvalidated assumption that it’ll reduce build latencies to have it locally.
Create a build trigger in Container Builder to automatically
build Daedalus' Dockerfile each time I push to the master branch, and upload the resulting
image into Container Registry, where it can be
used by GKE. (I had a short detour into tagging container images with branch instead of tag,
and it became clear that Kubernetes does not care for that approach, plus it breaks rollbacks.)
Get this approach working fully, including creating an ingress
resource with an HTTP(S) Google Load Balancer routing traffic to daedalus.
It was at this point that things detoured slightly, as GLBs don’t provide automatic SSL certificate
management capabilities similar to AWS ALBs, and it became clear that I would need to manage my own
Let’s Encrypt is the obvious solution, and it seemed like
kube-lego was the most common integration,
so I set that up. It did work, but the specific behavior I want ended up
being harder to model: I want all non SSL traffic to get passed through, and all SSL
traffic to get terminated and passed through. kube-lego wanted specific routing rules
for every domain and protocol (sort of necessary following from its approach to capturing the /.well-known/* path as required for
answering Let’s Encrypt’s domain ownership challenges),
and it claimed the route for unmatched traffic, directing it
to a default Nginx config as opposed to letting me point it to my daedalus nodes.
This didn’t work well given the number of domains I wanted to handle (mostly just redirecting
them to lethain.com, but enough that having a rule per domain was going to exceed my understanding
of GLB’s restrictions), so I was back at the drawing board.
After some thought, I decided the best solution was to split out my Nginx and SSL termination
work into its own deployment, such that I have two: frontend that hosts Nginx and does
SSL termination, and daedalus which only runs the Go binary.
It took a bit of configuration, but I was able to get this working, with a loadbalancer type service
that directed TCP traffic to my frontend deployment, which in turn reverse-proxied to
the daedalus deployment’s ClusterIP. I updated the Nginx configurations to serve the Let’s Encrypt
challenge tokens directly, and was able to pass the challenges.
(I wanted to get a new cert to lethain.com before switching DNS, so I actually ended up hacking
my AWS' instances' Nginx configs to reverse proxy to my GCP load balancer for the challenge files,
although in later readings I believe a 301 redirect would have worked just as well and been
a bit less annoying.)
This was working pretty well, with the caveat that I only had a single frontend pod, a necessary
simplification because certbot was writing local files and if we had multiple pods then some
wouldn’t have the file to respond to the challenge accurately.
After some reading, the simplest solution was to create a cert-refresh deployment, which I run
with a single instance, and then have the frontend pods reverse-proxy /.well-known/* back to
that instance. For the things that really matter, e.g. serving traffic, there is high availability;
cert renewal is technically no longer highly available, but the requirements are that it runs a refresh
script once a day, so it can fulfill its needs even with extraordinarily poor uptime.
This was quite a bit of debugging and hacking around to get working, but in the end it culminated in
six yaml files (three deployments, three services) and three Dockerfiles, such that I could recreate
the entire environment in a couple of minutes (ok, let’s call it thirty) if I had to.
More interestingly, it’s by far the most mature, convenient enviroment I’ve ever had for my
personal infrastructure. The entire pipeline after pushing to Git to having the container uploaded
is automated, and my deployment at this point is simply updating a deployment.yaml to point to the
new container’s tag, and Kubernetes does a incremental rollout of the new version, while respecting healthchecks
and pausing the rollout if healthchecks fail.
I can even rollback via the command line, which is a novel new idea for my
Events are sent to Google PubSub during container builds,
so it’s my guess that I’ll be able to add wiring there to have a complete continuous integration
pipeline with another hour or two of hacking around. I think, excitingly, that’s not just true for my personal applications like daedalus,
but also for the supporting infrastructure such as the frontend nodes. If I make invalid Nginx
configurations, I’ll no longer break my websites! (I’ve worked at real companies that couldn’t
make that same claim!)
Overall, I’ve been extremely impressed, and will move the remainder of my infrastructure over the
next week or two, giving me a slightly more complex environment to manage. Kubernetes was quite fun to
get more comfortable with, and the Google Cloud is mostly excellent, although it still hasn’t shaken
the feel that while its ideas may be simpler than AWS', its execution is still rougher around the
edges (for example, the experience of getting gsutil working is meaningfully more confusing than awscli).