Fewer, happier incident heroes.

Published on October 8, 2022. management (218), reliability (4)

My wife was kind enough to review a piece I’m writing about incident response, and brought up a related topic that I couldn’t fit into that article but is infinitely fascinating: how should companies respond to a small group of engineers who become critical responders in almost all incidents?

This happens at many companies, usually along the lines of: a few long-tenured engineers, who happen to have the explicit access credentials to all key systems and implicit permission to use them, help respond to almost all incidents. Over time, these folks become increasingly load bearing, as few others acquire the knowledge, and access credentials, to respond when they’re not available. Fast forward to the future, and one of these key responders leaves the company, which creates more load on the remaining key responders. More and more depart, and eventually the company has a painful era of relearning how to effectively respond to incidents.

Having seen this happen a lot, in most cases you can solve this through:

You can usually immediately identify the individuals filling this role without doing any extended analysis
Ask the frequent responders to be less responsive when they’re not explicitly on-call. They can still watch, but wait at least ten or twenty minutes before responding as long as an incident isn’t impacting users too severely
Ask the folks on-call to be more studied in when they escalate to the key responders
As part of incident response where key responders are pulled in, ask them to document how they responded and train the broader on-call rotation with that documentation
For any tool or system where documentation can’t close the gap after a couple of training sessions, prioritize designing something that can be mitigated in simpler ways
This behavior will spread, particularly with a subset of new hires who will replicate the behavior (show up in every incident) without the key responders’ knowledge to be valuable when they respond. These folks create a lot of noise in their emulation attempt, and you should give them feedback immediately. If you don’t it’s very hard to make progress on this issue
Promote and reward senior folks who run on-call rotations that succeed without their presence. Stop rewarding senior folks who have to personally show up to lead incident response
Educate other leaders within your company to similarly reward folks for building resilient on-call rotations rather than heroics

Nine times out of ten, that should be able to resolve the issues.

Sometimes you’ll run into a values misalignment with other leaders, and in that case progress is going to be slow and painful. The very easy but very painful path is to just wait until a few key responders quit, which will prove your point. The harder but less painful path is to build data to support your case. For example, automatically record who is active in which incidents (e.g. logging incident response channels in Slack/Teams/whatever), correlate that with user impact / mean-time-to-mitigation / etc, and build a case that overreliance is making you less effective when they aren’t present, e.g. this approach is generating uncompensated corporate risk. It’s slow, but the data will always teach you something interesting, although it’ll rarely teach you what you initially anticipate.