Don't follow the sun.

Published on July 3, 2019. reliability (4), infrastructure (57)

When I speak with engineering leaders, I sometimes get asked to endorse an underway plan to spin up a “follow the sun” on-call rotation. Instead of one team taking pages for the full day, they’ll split the load into 2 twelve hour shifts or 3 eight hour shifts.

My advice is not what folks anticipate: please don’t.

“Follow the sun” model is an evolutionary dead end for engineering on-call rotations. Having run the “follow the sun” on-call rotation for several years at Uber, I’ve come to believe they’re the worst kind of ineffective solution: one that appears to be working but never delivers the full requirements.

Motivations for this particular style of on-call rotation vary a bit, but the particularly common ones are:

incidents take too long to remediate and we want folks immediately available to start remediating to minimize downtime (cutting out the 5-10 minute delay of on-call connecting),
product engineering teams want to offload on-call to a reliability engineering team (and to remove themselves from on-call duties),
the on-call shift is too stressful to manage for a full day (and you’re afraid the team is going to quit if you can’t reduce their stress).

“Follow the sun” will appear to solve all of these problems, but won’t solve any of them effectively in the long-term. My opposition can be broken into a few themes:

Decouple providing service from handling exceptions. On-call should be used to handle exceptions, not to provide _services _(user value that scales with manual work) . Some companies use an on-call rotation to provide service to their customers in different timezones, and that’s a noble and just use of “follow the sun”, and not the sort of thing I want to discourage.

However, exception handling is quite different than service providing: the former depends on context and ought to represent a new, novel problem, and the latter ought to be executing a repeatable, well-documented process.

The on-call rotations that I’ve experienced which are very stressful typically violate this rule, and should be split into exception and non-exception tiers (sometimes the latter category is called the “run” rotation).

Humans are too slow, anyway. If you’re providing critical product or platform to your users, then human remediation in the moment is typically going to be too slow. Your actual remediation goals cannot be completed by any number of humans, whether you have one shift or one hundred, and the approach to making those humans more effective doesn’t contribute to the work of making the automation more effective.

You have to push the human remediation upstream into preemptive game-days and architecture, where they are not user impacting. You have to move towards fully automated solutions. This is very, very difficult, but as you build mastery, every incident is a novel incident, and consequently every incident is complex and requires subject matter expertise. (And the list of prevented incidents will line the streets behind you, growing rapidly and tilled by scripts rather than humans.)

This is one of the cruxes of why chaos engineering works: it moves human remediation out of the user-impacting flow by making them happen frequently with easier remediation (wouldn’t it be nice if you could always just turn off the chaos generating event during an incident?).

Shift transitions magnify error rate. If you currently have a week long on-call rotation, the shift to “follow the sun” transitions you from one shift per week to fourteen shifts. Shift transitions are well known to cause problems in the medical profession likely outweighing the downsides of longer shifts, and this mirrors my experience in software on-call rotations as well.
Missing context. Because every exception ought to be novel, resolving them depends on having as much relevant context as possible, particularly about recent change that may invalidate assumptions about cause and resolution for someone who doesn’t work in the system frequently.
Alignment and hierarchy. Finally, the industry is increasingly moving away from having an “operations” team that is responsible for production duties for a “product” team that develops the software. This divide of roles simply does not work in my experience, as the value streams and feedback loops are misaligned for everyone involved. It likewise creates a hierarchy amongst your engineering teams that doesn’t reflect the sort of organization I want to represent.

Annnd that’s my spiel about why I think “follow the sun” is an ineffective approach and shouldn’t be pursued.