Forecasting synthetic metrics.
Imagine you woke up one day and found yourself responsible for a Site Reliability Engineering team. By 10AM, you’ve downloaded a free copy of the SRE book, and are starting to get the hang of things. Then an incident strikes: oh no! Folks rally to mitigate user impact, shortly followed by diagnosing and remediating the underlying cause. The team’s response was amazing, but your users depend on you and you feel like today you let them down. Your shoulders are a bit heavier than just a few hours ago. You sit down with your team and declare your bold leader-y goal: next quarter we’ll have zero incidents.
Your team doesn’t know you well enough yet to give direct feedback when you have a bad idea, so instead they come back to you the next day with a list of projects to accomplish your goal. Their proposals range from the reliable (delete all your software and go home) to riskier options (adding passive healthchecks). You open your mouth to pass judgement on their ideas, pause with your jaw hanging ajar, and then close it. A small problem has emerged: you have no idea how to pick between projects.
It’s straightforward to measure your historical reliability using user impact, revenue impact or performance against SLOs. However, historical measures are not very helpful when it comes to determining future work to prioritize. Which work will be most impactful, and how much of that work needs to get done to make us predictably reliable rather than reliable through good luck?
One of my favorite tools for measuring complex areas are synthetic metrics. Synthetic metrics compose a variety of input metrics into a simplified view. For example, you might create a service quality score which is calculated by the service having been deployed recently, having zero undeployed CVE patches, having proper healthchecks, having tests which complete in less than ten seconds, and what not. Instead of having to talk about each of those aspects individually, you’re able to talk more generally about the service’s state. Even more useful, you can start to describe the distribution of service healthiness for all the services across your company.
When I first started experimenting with synthetic metrics, I read an excellent piece by Ryan McGeehan on risk forecasting for security. The techniques McGeehan discusses don’t quite solve the problem of prioritizing future work, but as I ruminated on them I fell into an interesting idea: can we use the forecast of synthetic metrics to determine and prioritize work?
Brian Delahunty recently moved into Stripe’s Head of Reliability role, and as part of that transition I’ve gotten the chance to partner with him, Davin Bogan, Drew Fradette, Grace Flintermann and a bunch of other great folks to rethink how we should select the most impactful reliability projects. What’s written here reflects a great deal of collective thought from those discussions, as well as inspiration from Niels Provos’ approach to security planning.
I’ve also previously written about using systems modeling to inform a reliability strategy, which has some similarities to the approach described here.
Forecasting reliability
Let’s dig into a specific example of how we can forecast synthetic metrics. I’ll focus on reliability, but I believe this technique is generally applicable to any area which typically grades execution against lagging indicators instead of leading indicators. It’s particularly valuable for areas with high-variance lagging indicators like security breaches, major incidents, and so on.
Before we can forecast a synthetic metric, we have to design the synthetic metric itself. To design a synthetic metric for reliability, a useful question to ask ourselves is what would need to be true for us to believe our systems were predictably reliable?
A quick thanks to David Judd who has spent a great deal of time digging into this particular question, and whose thinking has deeply influenced my own.
Some factors you might consider when calculating your reliability are:
- How safely do you make changes? This certainly includes deploying code changes, but also feature flags, infrastructure changes and what not.
- How many fault levels do you have which are backed by only a single fault domain?
- How long has it been since you verified the redundancies within each fault levels?
- How much headroom do you have for traffic spikes?
- How sustainable are your on-call rotations in terms of having enough ramped-up folks and appropriate page rate?
There are an infinite number of factors you could include here, and what you pick is going to depend on your specific architecture. What’s most important is that it should be staggeringly obvious what sort of project you’d undertake to improve each of these inputs. Too many unsafe changes? Build safer change management tooling. Haven’t exercised fault level redundancy frequently? Run a game day. Stay away from output metrics like having fewer incidents, which immediately require you to answer broad questions of approach.
To keep our example simple, let’s imagine we focus the first version of our reliability metric on: making safe changes, eliminating single-domain fault levels, and verifying redundancy within fault levels.
percent_safe_changes =
(percent_safe_feature_flag_changes * 0.3) +
(percent_safe_code_changes * 0.3) +
(percent_safe_infra_changes * 0.4)
percent_redundant_fault _levels = (some calculation)
percent_recently_exercised_domains = (some calculation)
reliability_forecast =
(percent_safe_changes * 0.4) +
(percent_redundant_fault_levels * 0.4) +
(percent_recently_exercised_domains * 0.2)
With some simple arithmetic and experimentation, you’ll compute a score that reflects your reliability risk as it stands today. Then by analyzing the trends within those numbers, you can forecast what that score will become a year from now. If you’ve invested heavily in quality ratchets, then you may find that you’ll become considerably more reliable over the next year without beginning any new initiatives.
You may, on the other hand, find that you’re spiraling towards doom. Either way, this is the starting point for your planning.
This measure then allows you to calculate the impact of each considered project, and to prioritize them based on the return on investment for future reliability. This also helps you structure effective requests to other teams. With this calculation, it’s now extremely clear what you would want to ask other teams to focus on, as well as why the work matters.
A quarter later, you can measure your updated reliability score and compare it against your forecast, determining if your projects had the expected impact. Now you’re able to review and evolve your reliability strategy and execution against something predictable, decoupling your approach from underreacting to narrow escapes or overreacting to unfortunate falls.
Translate score into impact
Now that you’re forecasting reliability, you might say that you are trending towards 52% reliability next year, and 40% the year after that. Some folks will be motivated by the scores feeling low, but these are pretty abstract numbers. How much should the company prefer a reliability forecast of 80% over a forecast of 79%?
It’s powerful to translate the forecasted score into a forecasted result. For reliability, we can calculate the impact of a serious incident and use the score as the probability of such an incident occurring in the next year. A simple version might take the average daily revenue for the next year and then multiply it by 1.0
minus your risk forecast.
impact = (1.0 - reliability_forecast) * avg_daily_revenue_next_year
Is this a particularly robust calculation? No, it’s really not. But are you having a conversation about the dollar impact of your reliability planning? Why, yes. Yes, you are.
Iterate to alignment
Folks will initially disagree with your forecasts and impact numbers. That’s the entire point! Each time someone disagrees with your forecast or projected impact, that is exactly the opportunity you’re looking for to refine your methodology. Metrics become valuable through repeated exposure to a medium-sized group with consistent membership over the course of months. It’s only this repeated application that can refine a complex metric into one that reflects both your worldview and the worldview of your stakeholders.
There is a lot to learn from Perri’s approach to product management in Escaping the Build Trap. It’s the results that matters, not shipping projects, and the first version of a metric never has the right results.
Becoming valuable
Sometimes teams working in areas like security or reliability find themselves smothered by the sensation that they are either (a) performing badly or (b) performing well enough to ignore. No place on the continuum between bad and ignored feels particularly good.
The twin techniques of forecasting synthetic metrics and translating those metrics into impact will extend your continuum to include acknowledged value to your company and your users. Even when things happen to be going well, you can showcase the underlying risk that requires continued investment. When things are going poorly, you can show you’re doing the right work, even if it’s not showing its impact yet.
I’m quite curious to hear from more folks trying similar approaches!
As an aside, I also want to mention how useful this approach can be for evaluating the quality of incident remediations. Many incident programs emphasize that folks must have remediations for each incident, and perhaps emphasize a strict deadline on when those remediations must be completed, but it’s often unclear if the remediations are of high quality. A good synthetic metric for reliability makes answering this question easy: a remediation’s quality is the extent that it shifts the reliability score in a positive direction.