July 3, 2019.
When I speak with engineering leaders, I sometimes get asked to endorse an underway plan to spin up a “follow the sun” on-call rotation. Instead of one team taking pages for the full day, they'll split the load into 2 twelve hour shifts or 3 eight hour shifts.
My advice is not what folks anticipate: please don’t.
"Follow the sun" model is an evolutionary dead end for engineering on-call rotations. Having run the “follow the sun” on-call rotation for several years at Uber, I’ve come to believe they’re the worst kind of ineffective solution: one that appears to be working but never delivers the full requirements.
Motivations for this particular style of on-call rotation vary a bit, but the particularly common ones are:
"Follow the sun" will appear to solve all of these problems, but won't solve any of them effectively in the long-term. My opposition can be broken into a few themes:
However, exception handling is quite different than service providing: the former depends on context and ought to represent a new, novel problem, and the later ought to be executing a repeatable, well-documented process.
The on-call rotations that I’ve experienced which are very stressful typically violate this rule, and should be split into exception and non-exception tiers (sometimes the later category is called the “run” rotation).
You have to push the human remediation upstream into preemptive game-days and architecture, where they are not user impacting. You have to move towards fully automated solutions. This is very, very difficult, but as you build mastery, every incident is a novel incident, and consequently every incident is complex and requires subject matter expertise. (And the list of prevented incidents will line the streets behind you, growing rapidly and tilled by scripts rather than humans.)
This is one of the cruxes of why chaos engineering works: it moves human remediation out of the user-impacting removing flow by making them happen frequently at with easier remediation (wouldn’t it be nice if you could always just turn off the chaos generating event during an incident?).
Shift transitions magnify error rate. If you currently have a week long on-call rotation, the shift to “follow the sun” transitions you from one shift per week to fourteen shifts. Shift transitions are well known to cause problems in the medical profession likely outweighing the downsides of longer shifts, and this mirrors my experience in software on-call rotations as well.
Missing context. Because every exception ought to be novel, resolving them depends on having as much relevant context as possible, particularly about recent change that may invalidate assumptions about cause and resolution for someone who doesn’t work in the system frequently.
Alignment and hierarchy. Finally, the industry is increasingly moving away from having an “operations” team that is responsible for production duties for a “product” team that develops the software. This divide of roles simply does not work in my experience, as the value streams and feedback loops are misaligned for everyone involved. It likewise creates a hierarchy amongst your engineering teams that doesn't reflect the sort of organization I want to represent.
Annnd that’s my spiel about why I think “follow the sun” is an ineffective approach and shouldn’t be pursued.