How to build a reliability program.
This is draft-y as f.
Reliability programs are an engine of learning that transmute incidents into reliability, and any company providing critical infrastructure should have one. Well, probably any company that believes they’re doing something valuable and has reached moderate scale should have one, even if you’re not providing infrastructure.
You see some companies try to avoid thinking through the details of a full reliability program by creating an Software Reliability Engineering (SRE) organization, but that’s a bit like creating a product organization because you don’t have a product strategy: the Levenshtein distance is small but the gap remains massive.
This is an area where each year I feel like I know twice as much as the year before, and even since February, I think I’ve learned enough to take another stab at describing an effective approach.
Precursor to value
yeah, this does matter, increasingly more as you grow
very few businesses start out prioritizing reliabiltiy: easy enough at small scale for small customers, reliability isn’t your value, reliability is a prerequisite to deliver value
but for large users / enterprise, yeah, reliability is a big part of your core proposition (your users are extremely vulnerabe to markets, users, reputation impact)
now your reliability is the only thing that allows your users folks to capture value from your product so… now reliability is your core business value
this isn’t implicitly true for some products, but e.g. Twitter’s lack of reliability was crippling since it breaks folks addiction … err “engagement loops” with these products
Start simple
A bit later we’ll get into how to cultivate reliability in large, complex organizations, but if you’re a smaller company, you don’t need to jump there directly.
Until you’re about one hundred engineers, you can get far with a few fundamentals:
- alerting on core business flows
- a mailing list that triggers an alert that anyone at your company can use to trigger an alert
- oncall rotation - https://increment.com/on-call/on-call-at-any-size/
- metrics and goals on health of those core business flows and health of on-call rotation
- review those metrics and goals periodically to kickoff a reliability sprint
- a couple senior engineers who schedule an incident retrospective to mitigate particularly painful incidents, roughly along the lines of this classic Allspaw writeup from 2012
Becoming consistent
the above approach isn’t enough, gotta go bigger why not? - decide it’s worth having a reliability program
the right proactive work has higher ROI than reactive work the wrong proactive work has zero ROI
Useful goals
In Escaping the Build Trap, Melissa Perri says that “in product-led organizations, people are rewarded for learning and achieving goals”, and the aim of measuring a reliability program is quite similar: we want to understand our rate of learning and whether we’re becoming more reliable.
user-goal focus don’t measure incidents do measure severity of incidents, impact should go down high dimensionality challenges cohorts, cohorts, cohorts
If you’re looking for more general thoughts on goals, I’ve previously written more general advice on goals and baselines as well as measuring areas that defy easy measurement.
Strategy
The strategy guiding your incident program is
Engine of learning (and impact)
https://melissaperri.com/blog/2015/07/22/the-product-kata
Based on my experience designing organizational programs and fostering engagement with those programs, the heartbeat of successful programs:
- Identify top problem that your program is facing, based on user feedback, stakeholder asks and your domain expertise.
- Build reproducible, ongoing dataset that will support understanding and goal-setting for that problem. (It’s critical that it’s not one-off and not manual.)
- Set goal for how you want to impact that problem. You’ll also likely need to set goals around maintaining a healthy dataset.
- Make problem easily understandable with a dashboard that defines the problem with data.
- Create reusable playbook for teams to use when they identify they have a problem with their data contributions or contributions to the problem itself.
- Create nudges to generate awareness for teams that there are actions they should take to support the goal.
- If nudges aren’t enough, create goals for teams to support the goal.
- If goals aren’t enough, create escalation loop that allows teams to align with stakeholders on priorities.
- Go back to the beginning and find a new problem.
Building the dataset
Pagerduty’s Incident Response documentation
- study incidents
- design a reliability program
- start gathering data from incidents
- create tracking dashboard to inspect data quality and learn from it
- create nudges to promote data collection
- set goals to ensure data is being collected
Data to themes, themes to patterns
- incident themes
- use that data to create themes
- mitigation patterns
- use those themes to create mitigation patterns
- create nudges to promote using mitigation patterns
- set goals to ensure mitigation patterns are being followed
- prevention patterns
- use those themes to create prevention patterns
Prevention pattern propagation
- extend dashboard to include prevention patterns
- measure the quality of the mitigation and prevention patterns
- document best practices for reliability and how to implement them
- create feedback loop to nudge folks towards reliability practices
- create goals to ensure reliability practices
- reliability into the architecture
Fewer than 99 problems
Chaos Engineering: the history, the principles and the practice
- injecting failure
- you want to make fixes earlier and with less impact
- inject failures: automation, game days
- start with least reliability components
- great opportunity for pristine data, don’t have to wait for ambient error rate
- drain the stock of latent errors
Breaking up the band
when would you get rid of the program? prevention patterns and fault injection is ratcheted in and work incident rates and user impact is very low this never happens if you’re a rapidly growing company, but could happen if you’re not
“Why not just use SRE?”
sigh, no, what do these words even mean
it’s much harder to introduce speicalized roles like SRE than folks think
you do need subject matter experts, if that’s what you mean, then sure, but you can’t outsource this shit to team that “implements reliability”
Closing thoughts
yup
Resources
This is a broad topic, and I’ve collected some related resources here that came up while writing or that I’ve found useful in general.
Related stuff from this blog:
- Writing a reliabilty strategy
- Writing strategies and visions
- Metrics for the unmreasurable
- Guiding broad change with metrics
- Programs: tips for owning the unownable
- Fostering program engagement
- Infrastructure planning: users, balines and timeframes
- How to invest in technical infrastructure
- Introducing SREs, TPMs and other specialized roles
- Service cookbooks
Related stuff from elsewhere: