How to build a reliability program.

Published on June 30, 2019. reliability (4), infrastructure (38)

This is draft-y as f.

Reliability programs are an engine of learning that transmute incidents into reliability, and any company providing critical infrastructure should have one. Well, probably any company that believes they’re doing something valuable and has reached moderate scale should have one, even if you’re not providing infrastructure.

You see some companies try to avoid thinking through the details of a full reliability program by creating an Software Reliability Engineering (SRE) organization, but that’s a bit like creating a product organization because you don’t have a product strategy: the Levenshtein distance is small but the gap remains massive.

This is an area where each year I feel like I know twice as much as the year before, and even since February, I think I’ve learned enough to take another stab at describing an effective approach.

Precursor to value

yeah, this does matter, increasingly more as you grow

very few businesses start out prioritizing reliabiltiy: easy enough at small scale for small customers, reliability isn’t your value, reliability is a prerequisite to deliver value

but for large users / enterprise, yeah, reliability is a big part of your core proposition (your users are extremely vulnerabe to markets, users, reputation impact)

now your reliability is the only thing that allows your users folks to capture value from your product so… now reliability is your core business value

this isn’t implicitly true for some products, but e.g. Twitter’s lack of reliability was crippling since it breaks folks addiction … err “engagement loops” with these products

Start simple

A bit later we’ll get into how to cultivate reliability in large, complex organizations, but if you’re a smaller company, you don’t need to jump there directly.

Until you’re about one hundred engineers, you can get far with a few fundamentals:

alerting on core business flows
a mailing list that triggers an alert that anyone at your company can use to trigger an alert
oncall rotation - https://increment.com/on-call/on-call-at-any-size/
metrics and goals on health of those core business flows and health of on-call rotation
review those metrics and goals periodically to kickoff a reliability sprint
a couple senior engineers who schedule an incident retrospective to mitigate particularly painful incidents, roughly along the lines of this classic Allspaw writeup from 2012

Becoming consistent

the above approach isn’t enough, gotta go bigger why not? - decide it’s worth having a reliability program

the right proactive work has higher ROI than reactive work the wrong proactive work has zero ROI

Useful goals

In Escaping the Build Trap, Melissa Perri says that “in product-led organizations, people are rewarded for learning and achieving goals”, and the aim of measuring a reliability program is quite similar: we want to understand our rate of learning and whether we’re becoming more reliable.

user-goal focus don’t measure incidents do measure severity of incidents, impact should go down high dimensionality challenges cohorts, cohorts, cohorts

If you’re looking for more general thoughts on goals, I’ve previously written more general advice on goals and baselines as well as measuring areas that defy easy measurement.

Strategy

The strategy guiding your incident program is

Engine of learning (and impact)

https://melissaperri.com/blog/2015/07/22/the-product-kata

Based on my experience designing organizational programs and fostering engagement with those programs, the heartbeat of successful programs:

Identify top problem that your program is facing, based on user feedback, stakeholder asks and your domain expertise.
Build reproducible, ongoing dataset that will support understanding and goal-setting for that problem. (It’s critical that it’s not one-off and not manual.)
Set goal for how you want to impact that problem. You’ll also likely need to set goals around maintaining a healthy dataset.
Make problem easily understandable with a dashboard that defines the problem with data.
Create reusable playbook for teams to use when they identify they have a problem with their data contributions or contributions to the problem itself.
Create nudges to generate awareness for teams that there are actions they should take to support the goal.
If nudges aren’t enough, create goals for teams to support the goal.
If goals aren’t enough, create escalation loop that allows teams to align with stakeholders on priorities.
Go back to the beginning and find a new problem.

Building the dataset

Pagerduty’s Incident Response documentation

study incidents
- design a reliability program
- start gathering data from incidents
- create tracking dashboard to inspect data quality and learn from it
- create nudges to promote data collection
- set goals to ensure data is being collected

Data to themes, themes to patterns

Service cookbooks

incident themes
- use that data to create themes
mitigation patterns
- use those themes to create mitigation patterns
- create nudges to promote using mitigation patterns
- set goals to ensure mitigation patterns are being followed
prevention patterns
- use those themes to create prevention patterns

Prevention pattern propagation

extend dashboard to include prevention patterns
measure the quality of the mitigation and prevention patterns
document best practices for reliability and how to implement them
create feedback loop to nudge folks towards reliability practices
create goals to ensure reliability practices
reliability into the architecture

Fewer than 99 problems

Chaos Engineering: the history, the principles and the practice

injecting failure
- you want to make fixes earlier and with less impact
- inject failures: automation, game days
- start with least reliability components
- great opportunity for pristine data, don’t have to wait for ambient error rate
- drain the stock of latent errors

Breaking up the band

when would you get rid of the program? prevention patterns and fault injection is ratcheted in and work incident rates and user impact is very low this never happens if you’re a rapidly growing company, but could happen if you’re not

“Why not just use SRE?”

sigh, no, what do these words even mean

it’s much harder to introduce speicalized roles like SRE than folks think

you do need subject matter experts, if that’s what you mean, then sure, but you can’t outsource this shit to team that “implements reliability”

Closing thoughts

yup

Resources

This is a broad topic, and I’ve collected some related resources here that came up while writing or that I’ve found useful in general.

Related stuff from this blog:

Related stuff from elsewhere: