Uber, starting the SRE and platform engineering groups.
What we’ll cover today, in roughly five minute chunks, is:
Alright, let’s get started!
At jobs past, I’ve occasionally met folks who viewed “infrastructure” as “anything my team doesn’t want to do”, but that turns out to be a surprisingly fluid definition. Let’s do better.
Technical infrastructure is the software and systems we use to create, evolve and operate our businesses. The more precise definition that this talk will use is, “Tools used by many teams for critical workloads.”
Some examples of technical infrastructure are:
As you can see, we’re working with a broad definition.
This sort of infrastructure is a rare opportunity to do truly leveraged work. If you make builds five minutes faster, and do a hundred builds a day, then you’ve saved eight hours of time that can be invested into something more useful.
The consequences of doing infrastructure poorly are equally profound. If you build unreliable software, your users won’t be able to take advantage of what you’re built. You and your peers will spend your time mitigating and remediating incidents instead of building things that your users will love.
Certainly my perspective is colored by my experiences, but I truly believe that behind every successful, scaled company is an excellent infrastructure team doing remarkable things. Companies simply cannot succeed at scale otherwise.
Before we talk about how to invest and evolve, it’s helpful for us to have a shared vocabulary about types of work we’re asked to work on. Each company has its own euphemisms for these things, like “KTLO”, “research”, “prototype”, “tech debt”, “critical”, “most critical”, and so on.
Once we develop this vocabulary for a couple of minutes, we’re going to be able to start talking about the specific challenges and constraints Stripe’s infrastructure teams have run into over the years, and how we’ve adapted to and overcome each of them.
The first distinction that’s important to make is between forced and discretionary work. Forced work is stuff that simply must happen, and might be driven by a critical launch, a company-defining user, a security vulnerability, performance degredation, etc. Discretionary work is much more open ended, and is the sort of work where you’re able to pick from the entire universe of possibilities: new products, rewrites, introducing new technology, etc.
The second distinction is between short-term and long-term work. Short-term might be needing to patch all your servers this week for a new security vulnerability. Long-term might be building the automation and tooling to reduce the p99 age of your compute instances, which will make it quicker and safer to deploy patches in the future.
Then it starts to get really interesting when we combine these ideas.
Now we have this grid, with work existing within two continuums. You might argue that some of these combinations don’t exist, but I’m certain they all do, and that they’re all quite common:
You can plot each project somewhere into this grid, and you can also plot teams’ into the grid. To plot a team, you just sort of sum together all the vectors of the team’s projects.
This opens up an interesting question, where do you want your team or organization to be within this grid?
This is the sort of team that many folks initially think they want to join. However, the complete lack of forced and short-term work typically implies that there are few to zero folks using your software, or that the feedback loop between those users and the developers is entirely absent.
You often see this in teams that are described as “research teams.”
This is another common one, typically of teams which are underresourced to their purpose, hopping from fire to fire, trying to stabilize today with dreams of innovating in some distant tomorrow.
Not many folks want to join these teams, but sometimes folks regret leaving them, since these teams have a unique clarity of priority and leadership appreciation of their work.
This is where I think teams should strive to exist. You want some forced work, because they means that folks are using your system for something that matters. On the other hand, you also want some discretionary work, to ensure that you’re investing into building compounding technical leverage over time.
Alright, so we have a mental model, and now we can start to discuss the evolution of Stripe’s approach to infrastructure.
An important aspect of infrastructure engineering, often differentiating it from product engineering, is the percentage of work that simply must be done. The most common case is when a company is experiencing repeated outages, such as Twitter’s fail whale era.
I’m certain that most folks at Twitter didn’t want to be working on stability, but it got to the point where it simply wasn’t possible to prioritize anything else.
Within our framework, this is a team dominated by forced, short-term work, which goes by a variety of different names, firefighting being a popular one.
Examples for talk:
Once your team has fully dug out from fire fighting, you have a new problem: you can do anything! It seems wrong to describe this as a problem, but my experience is that most infrastructure teams really struggle when they make the shift from the very clear focus created by emergencies to the oppressively broad opportunity that comes from escaping the tyranny of forced, short-term work.
Indeed, it’s my belief that most infrastructure teams spend so little time operating in a long-term, discretionary that they stumble when they encounter it. These stumbles often land them back into the warm, clarity-filled hell of firefighting.
If you’re aware of this transition, you can manage it deliberately, and this is when things get really fun! You’ve gotta be intentional about it.
You’re used to having clear priorities without much work: it’s easy to spot a fire. That’s not going to work anymore now, you’re going to have be very intentional about determining what to work on and how to approach it.
Really, what you’re about to do is start applying the fundamentals of product management: discover problems, prioritize across them, validate your planned approach.
Discovery. When you have surplus engineering capacity, folks tend to have a long backlog of stuff they’d like to work on, and many teams immediately jump on those, but I think it’s useful to fight that instinct and to step back and do deliberate discovery.
Cast a very broad net and don’t filter out seemingly bad ideas, capture as many ideas as possible. Some techniques that work: user surveys, coffee chats, user group discussions, agreeing on SLAs, peer-company chats, academic and industry research.
If you don’t have folks clamoring for what you’re trying to build–if other teams are not frustrated that you can’t deliver it sooner–than it’s quite possible that you’re solving the wrong problem.
Prioritization. You’ve got to rank the problems. I prefer return on investment as the sort order, but there are a bunch of options. To understand long-term ROI though, you really need to have two things: a clear understanding of what your users need, and a long-term vision for what your team hopes to accomplish (maybe three years or so).
If you’re prioritizing without your users’ voice in the room, your priorities are wrong.
Validation. Once you’ve selected the problems to start working on, you should go through a deliberate phase of trying to disprove your approach. You want to identify the hardest issues first, and then you can adapt your approach with a minimum of investment. Finding problems late is much. much worse.
Some techniques to user here are: embedding with user teams/companies, building prototypes, solving for the hardest user first (not the easiest one as some do), having clear success metrics and checking against them often.
Have to maintain enough mass to your team to stay in innovation. If you start innovation teams too small or destaff them for other work, they’ll not work out. One of our success criteria for new teams is that they immediately enter the innovation model, not the fire fighting one. We facilitate that through our consistent approach to sizing teams.
As a corollary, if you don’t do a good job of picking problems and solutions, then a well run company will destaff your team. It’s an earned and maintained opportunity to have significant discretionary budget.
Consider getting product managers into your infrastructure organization. Not every team needs them all the time, but having even a few who can help navigate the transition from firefighting to innovating is super helpful.
Some random asides related to this topic:
So we’re getting to a point where we have a clear framework for getting teams out of firefighting, and a framework for helping teams succeed and remain in the innovation mode, so in theory everything should be great.
But, it’s often not.
Maybe it’s an obvious but ignored scalability problem bursting onto the scene. Or who didn’t hear about one company or another dropping everything to meet their GDPR obligations before the deadline? In both cases, I’ve found that it’s often the case that long-term, forced work goes unresourced becoming an avalanche of unexpected short-term, forced work.
We’ve adopted an approach that I call the “five principles of infrastructure” to help us think about these sorts of problems in a structured way.
So we’ve covered quite a bit so far! Firefighting, innovation, baselines, and so on. If you’re working with one team or a smaller organization, then this is probably enough to guide your infrastructure team for the next year or two.
However, things get rather complicated when you’re responsible for coordinating an organization with a couple dozen infrastructure teams, which is the problem we were facing towards the end of last year.
Then challenge then is to take all these ideas, and find a way to provide useful guidance for many different teams when, that accounts for the fact that some of those teams are going to be in innovation mode, others firefighting, and others somewhere inbetween.
What we’re about to talk about is how Stripe’s Foundation engineering team of about two hundred engineers does our planning on behalf of Stripe’s roughly six hundred engineers.
Ok, so let’s recap what we’ve discussed a bit.
There’s no single approach to investing into infrastructure that’s going to work for you all of the time. Identifying an effective approach depends on recognizing your current constraitns, and then accounting for them properly. While no one has solved this, but there are a bunch of good patterns to reuse:
Like any good strategy, the hard part here is really cultivating honest self-aware about your current situation. Once you understand your situation, design a solution just requires thorough thoughtfulness, rigorous studying of requirement, and continuing to refine the course.