Usable QA Environments
Ok, everyone raise their hands.
If you don’t have a QA environment, put your hand down.
If your deploys don’t always go through QA, put your hand down.
If your QA environment is broken right now, hand down.
If you’ve considered deleting your QA environment and restarting from scratch, hand down.
If you have a production outage in the past week due to an issue your “QA env should have caught”, hand down.
Anyone left with a hand up?
Production environments have a clear mandate: keep your customers happy. I’d argue that quality assurance environments have an equally clear mission of keeping your developers happy, but it’s surprisingly common to hear that companies don’t have an effective QA environment or even don’t have one at all.
This is a shame, because along with unit tests to verify each component’s local correctness, a QA environment is a remarkably effective tool for developing working software.
They are an excellent testing ground to stage code changes for integration tests to run against (ideally triggered after every deploy to qa), and more importantly they are perhaps the only testing ground for system changes to your environment like configuration changes. They can also be great for load testing!
But they’re certainly not easy.
Once you do indeed have a QA environment setup and running for a while, the first challenge you often run into is that it’s broken or unusable far more frequently than your production environment. Often someone or a team ends up “responsible for the QA environment”, mirroring the historical developer/operations split of operations “owning production.”
The best solution I’ve seen is the same solution we’re seen gradually adopted to address the poorly aligned dev/ops split in production: sharing ownership for QA environments broadly across all teams who use it. Specifically, I believe that means your observability stack should alert owning teams for issues in QA, just like they would for a production outage. (Perhaps inverting the primary/secondary rotation so that the secondary would get QA pages first, reducing the likelihood of distracting the primary oncall during an outage. Hey, you could even use QA oncall as the training rotation for your production rotation and use this as a training mechanism!)
Alerting on QA issues is fairly controversial, but I think the argument is fairly straightfoward: any breakage in QA is a harbinger of a similar breakage in production, so you’re far better to get interrupted with a problem when you have time to diagnose and debug it than when it’s impacting your customers.
Once you’re treating QA like a critical environment worthy of active repair, the next major failure mode is having so little activity that it doesn’t accurate simulate your production environment.
There are loosely two approaches for addressing this: creating synthetic load and replaying production traffic. The former is easier to set up, but requires active maintenance and crafting of new test cases to exercise new functionality. The later is easier to setup, but it’s often quite a challenge to maintain environments such that traffic against production can be usefully replayed against QA (if QA is down for an hour due to a breakage which avoids production, then replaying requests may be more-or-less impossible due to inconsistent state).
The best strategy I’ve seen here is sanitized, partial daily (or fresher!) snapshots from production and then replaying traffic starting immediately after the time the snapshot started (very likely injecting a position marker into your traffic log to aid synchronization of snapshot and traffic logs). Sanitization is critical because it’s rare to treat QA snapshots with the same tender loving care that you treat production data, so you don’t want anything sensitive in there. Doing a partial subset is important because it allows you to keep the size of the dataset small enough to generate frequently and load onto laptops or small virtual machines.
If you can invest into getting it setup, this combination of alerting on QA issues to fix environment problems just as if they are production issues (and indeed, they are simply a prelude), recreating state on the daily from snapshots, and generating ongoing load by replaying request logs gives you a solid, usable and useful environment.
Perhaps equally interesting is the question why it’s still so hard to get an effective QA environment put together. As tooling like Docker, rkt and Kubernetes combine to make service provisioning relatively trivial, it’s easy to feel like we’re still operating in the naughts.
I’m rather optimistic!
We’re in the early phase of figuring out how to take full advantage of generalized scheduling tools like Kubernetes, and we’ll start to see some pretty amazing integrations and tools over the next few years. It’s only a matter of time before we see a Chaos Monkey for Kubernetes, load generating tooling which relies on k8s for service discovery (and back off if it causes too many nodes to go out of rotation) while simultaneously scheduling its own load generating nodes on k8s! Altogether, it feels like a rich vein of relatively untapped open source opportunities.
And I could use all of them.
Today.