Netflix’s Chaosmonkey was for many software engineers
their first introduction to the fault injection,
and did a remarkable job of popularizing the concept. As popular as the idea has become, adoption
remains fairly low at most companies, especially on the smaller side.
In part I believe that is related to the overarching trend of moving towards immnutable
infrastructure (Docker containers, stateless services, Kubernetes, etc), which greatly
narrow the ways in which failures occur (statefulness is the gateway to unrecoverable failure).
It’s also because the tools remain underintegrated.
If you’re using Spinnaker, then enabling Chaosmonkey can be
as simple as never unclicking a checkbox
when you provision a new service, and ideally we’ll get to a place where AWS AutoScaling Groups
and such opt users into this behavior as well!
As an experiment towards that end, I wrote up a simple proof of concept,
on Github at lethain/k8s-fault-injection,
which allows Kubernetes deployments to opt in to periodically terminate pods.
This specific implementation is quite poor, but I think does a reasonable job of
exploring what you could do fairly easily to start running a simple fault injection
program of your own if you’re running on Kubernetes.