Sketching out failure injection on Kubernetes.
tl;dr - lethain/k8s-fault-injection is a proof of concept for fault injection on Kubernetes.
Netflix’s Chaosmonkey was for many software engineers their first introduction to the fault injection, and did a remarkable job of popularizing the concept. As popular as the idea has become, adoption remains fairly low at most companies, especially on the smaller side.
In part I believe that is related to the overarching trend of moving towards immnutable infrastructure (Docker containers, stateless services, Kubernetes, etc), which greatly narrow the ways in which failures occur (statefulness is the gateway to unrecoverable failure). It’s also because the tools remain underintegrated.
If you’re using Spinnaker, then enabling Chaosmonkey can be as simple as never unclicking a checkbox when you provision a new service, and ideally we’ll get to a place where AWS AutoScaling Groups and such opt users into this behavior as well!
As an experiment towards that end, I wrote up a simple proof of concept, on Github at lethain/k8s-fault-injection, which allows Kubernetes deployments to opt in to periodically terminate pods.
This specific implementation is quite poor, but I think does a reasonable job of exploring what you could do fairly easily to start running a simple fault injection program of your own if you’re running on Kubernetes.