Early on in your company’s lifetime, you’ll form the seed of your infrastructure organization: a small team of four to eight engineers. Maybe you’ll call it the infrastructure team. It’s very easy to route infrastructure requests, because they all go to that one team.
Those are both stable organizational configurations, but the transition between them can be a difficult, unstable one to navigate, and that’s what I want to dig into here. I’ll start by surveying my experience helping to ramp Uber’s infrastructure organization, abstract that experience into a playbook, and end by discussing some arguments that folks raise against this approach.
When I joined Uber, the Infrastructure organization consisted of three teams (whose names were unhelpfully generic, so I’m renaming them a bit for clarity): developer productivity who worked on build and test (~4 engineers), storage engineering (~6 engineers) who worked on scaling real-time storage, and operations (~5 engineers) who did everything else to support the company’s ~200 engineers, ~2,000 employees, and ~400% YoY growth in both usage and engineering headcount.
The first two teams were focused on acute, critical projects: keeping the engineering team productive and sharding our data to ensure we didn’t exhaust the disk space on the largest-we-could-buy hardware supporting our primary database cluster. The third team, the one I joined as its engineering manager, was responsible for keeping everything else going while the first two teams addressed their urgent focus areas.
On operations, our immediate challenges were significant: our self-managed compute cluster ran out of capacity every Friday leading to reduced availability (and at that point we were in a managed datacenter with limited capacity), our Kafka cluster was experiencing significant challenges with load, our Graphite cluster was frequently going down under load, the recently introduced move to a service oriented architecture depended on our team doing one to two days of work for each additional service, with new service provisioning requests coming in daily, and we handled on-call for the entire company with literally hundreds of alerts coming in most on-call shifts (it was not unusual for your phone’s battery to die during the 12 hour, follow the sun shift).
This was, objectively, a pretty difficult situation. That said, we started to work the problem:
We reworked our interviewing process to accelerate hiring. We knew if we hired behind the larger organization, we would fall even further behind as engineering headcount was a major input into the volume of incoming requests. We hired from 5 to 70, all external hires, over a two year period
We created a service cookbook so we could tag incoming requests to better understand where our time was going
We learned that service provisioning was our biggest source of time consumption, and it was a particularly consuming task because it required so many back and forth requests with the requesting team. We set up a request flow that required folks to supply all the necessary information along with their initial request. The volume was still overwhelming, so we hired an earlier career engineer whose initial project was to handle all incoming provisioning requests. This reduced interruptions for the wider team so that they could better focus on building an automated solution, but it also served as a backstop for service provisioning: if that engineer fell behind the incoming request load, we just went slower. As the team continued to grow, we spun out a services engineering team who fully automated the provisioning flow. About 15 months after I started, no humans were involved in service provisioning requests, which had now been migrated out of our initial data center into three new data centers
Three specific teams were placing significant and bespoke demands on the team. When we supported one team’s requests, they were always followed by even more requests. When we prioritized one team, the other two would be increasingly upset that we hadn’t prioritized them. When we prioritized any of these teams, the long tail of teams in the organization would be upset instead. To address this we spun up an embedded SRE function, where each of these high demand teams got two SREs that exclusively supported their requests, but they had to prioritize tasks to those SREs themselves. This become a deliberate bottleneck on the amount of one-off support we provided to those teams, creating space for us to innovate on more scalable solutions
Graphite, our metrics aggregator, was becoming overloaded with too many incoming metrics. There were simply too many incoming metrics from too many machines. We started by guarding Graphite behind a small pool of servers running a C reimplementation of statsd, which aggregated thousands of servers’ worth of metrics to four or five servers’ worth. We moved from TCP to UDP metric submission, and simply dropped the metrics we couldn’t process in a timely fashion. This allowed a baseline of stability, admittedly without much accuracy, while we worked to scale up the broader backend system. Eventually we lost confident in Graphite’s scalability and spun off a team to build M3, which solved the operational metrics problem for Uber
In our configuration, Kafka was only generally reliably shipping logs in our setup rather than providing the at-least once guarantee we required for some categories of logs. We did significant work stabilizing our Kafka cluster, and eventually spun out Kafka maintenance to a new team within the Data organization. That team invested heavily into Kafka, and our infrastructure became robust and reliable
We initially routed internal requests through an instance running HAProxy on every server. As the number of servers grew, these distributed instances performing health checks became a DDoS of its own. We reduced health checks, which bought us a few weeks of time. We added a health check cache running in Nginx on every hots to intercept incoming requests. Eventually these solutions simply ran out of runway, and we spun off a team that built a tiered health checking infrastructure that checked each host O(1) times rather O(servers*avg-number-services-per-host). That tiered health checking solution solved service routing scalability for our needs
That was a lot of work, which happened over the roughly two years that I worked at Uber, and we certainly did a bunch of other stuff as well: we also migrated out of our first data center, spun up (and down) two data centers in China, supported the deprecation of the original monolith, and so on.
The core organizational pattern was identifying the biggest emergency or largest source of incoming work, finding a way to provide a bounded level of quality of service, and focus as much energy as possible on innovation cycles that solved the underlying problem. If the underlying problem was too large to solve in a few weeks, then once we had the headcount, we would spin out a new team with the solitary focus on solving that problem.
This wasn’t glamorous, these were two very difficult years, but it does illustrate how that core pattern of exchanging short-term low quality of service to provide long-term high quality of service can overcome remarkably challenging circumstances.
Rules of Scaling Infrastructure Organizations
Exchanging quality of service for investment bandwidth is a key tradeoff within an infrastructure organization, but it’s hardly the only one. Operating an infrastructure organization is maintaining a dynamic balance across many forces. You need to balance tech debt against morale. You need to balance iterating on the usability of your capabilities against delivering them before being crushed by an exponentially scaling problem tomorrow. You also need to balance your budget.
Working through those challenges, I’ve come to appreciate there are two fundamental rules (with two corollaries) to successfully operating this sort of organization:
Rule One: You must maintain service quality high enough that your leadership team doesn’t throw you out
Rule Two: You must maintain a sizable investment budget to prevent exponential problems from sinking your organization
Building on the two rules are these two corollaries:
Corollary One: If morale is too low, service quality and investment budget will both collapse (as folks leave with the essential context)
Corollary Two: If your budget is too high, it’ll get compressed (which makes everything else much harder)
If you can solve for all four of those, it’s a relatively easy job.
Trunk and Branches Model
The solution I’ve found effective for addressing the infrastructure organization rules is an approach I call the Trunk and Branches Model. You start with a “trunk team” that is effectively your original infrastructure team. The trunk is responsible for absolutely everything that other teams expect from infrastructure, and might be called something like “Infra Eng,” “Platform Eng,” or “Core Infra.”
As the team grows, you identify a particularly valuable narrow subset of the work. Valuable here means one of three things:
it’s an exponential problem that will overrun your entire organization if you don’t solve it soon; for example, test or build instability accelerating as you hire more engineers
It’s a recurring fire that is undermining your company with users; for example, database instability causing site outages
It’s an internal workflow that’s starving your team’s ability to make investments; for example, a clunky process for manually spinning up new services in a company accelerating service adoption
You then create a narrowly focused “branch team” that wholly takes responsibility for that subset of work. This might be a Storage team that is responsible for all real-time data storage and retrieval. This might be a Services team that is responsible for all service provisioning. This team is responsible for both solving the immediate and long-term problems associated with their area of focus. Providing operational support within their vertical ensures they are tightly connected to their users real problems. Sufficient team staffing to support investment allows them to solve problems through platforms and automation rather than linearly scaling the team’s staffing.
Each time the trunk team grows beyond six to eight engineers, split off another branch team to focus on whatever your biggest problem or opportunity happens to be. Keep doing this for a few years of rapid growth, and your initial infrastructure team will have grown into an infrastructure organization.
Now that we’ve summarized the Trunk and Branches model, it’s worth addressing how it handles the challenges highlighted in the _Infrastructure Organization Rules _section above.
The first challenge is maintaining sufficiently high service quality at each point of growth such that you maintain the confidence of your peers and leadership. This model ensures there is always a clear responsible team for incoming asks, and facilitates spinning out the highest burden asks into branch teams with enough staffing to solve the underlying need with sublinear staffing
The second challenge is maintaining a sizable investment budget to prevent unchecked growth of exponential problems. This model spins off branch teams to consolidate investments on the most valuable problems.
The third challenge is maintaining sufficiently high team morale to retain your team. Branch team morale is driven by the focus and staffing to solve high impact problems. Trunk team morale is driven by folks who enjoy fighting fires and one-off solutions like bonuses, increased PTO and so on. (These solutions are temporary because the trunk team disappears as the organization grows sufficiently large.)
The final challenge is giving you the flexibility to maintain a reasonable budget. Headcount budget is maintained by restricting the number of branch teams. Infrastructure budget is maintained by spinning out an infrastructure efficiency team if operating costs begin to grow too quickly.
This isn’t easy, and it requires making bets on the right branches, but in my experience it does consistently work as long as your company views infrastructure as an essential contributor to its success rather than a cost-center to minimize.
Operating Trunk and Branch Model
Now that we’ve dug into the model and how it solves the underlying dynamic balance, there are a few operational aspects worth expanding upon:
The combination of trunk and branches must be mutually exclusive, collectively exhaustive. Many infrastructure organizations think they can simply “unown” critical work, but this doesn’t work. You’re better off having the trunk team explicitly own the area with a reduced service commitment than to have no official owner
Maintaining morale within the trunk team is an ongoing priority that requires active attention. The trunk team will eventually disappear as you build out branches, so you can do things that don’t work in the long run. Give team-specific bonuses for folks who stay on the trunk team for six months. Provide additional time off for the trunk team. Spend more time with them personally and celebrate them publicly
It’s ok to have significant intensity for a given team at a given point. I’ve consistently found that teams rise to meet temporary adversity. Where teams, and morale, suffers is prolonged exposure to adversity for a given group. This model shifts adversity by spinning out branch teams (to take adversity off the trunk team) and staffing the branche teams (to invest their way out of adverse conditions). If you pick and choose components from the model without ensuring that adversity rotates, then it won’t work out very well
Only add branches when the team sizing math works. The trunk team must never shrink below six to eight engineers. The new branch team should have at least three engineers. All existing branch teams should have at least five engineers. If you can’t properly staff a new branch, then it’s better to move work across teams (e.g. expand scope of an existing branch) than to create a new one. Each branch needs to both operate existing infrastructure and invest into a replacement, which depends on a decent level of staffing, otherwise you’re not actually resourcing them properly to dig out, and this isn’t going to work
If you urgently need more branch teams than you can staff according to the above rules, then you have a headcount planning problem which you should address directly rather than by attempting to spin out understaffed teams
Inspect new branches to ensure they’re investing into a scalable solution rather than manually working through the problem. Each branch needs to scale their solution with a sub-linear investment of headcount. Watch carefully to ensure that’s happening
You cannot replace the trunk team with a rotating on-call. This will sort-of work early on, but eventually the number and complexity of the systems to maintain will be too high. You’ll end up having shadow on-call rotations (“Call Laura, she’s the only one who knows how PostgreSQL really works.”) prolonged incidents due to lack of context (“I thought we could just restart that!”), and it’s unclear who is responsible for paying down the most urgent problems. This will cause you to under-deliver on service quality, violating the first rule of infrastructure organizations (“you must maintain sufficiently high service quality”)
You cannot replace the trunk team with a team staffed with a rotating membership. This works a bit better than only having a rotating on-call, but it struggles for all the same reasons
If you’re concerned you’ll need an unreasonable number of branch teams, then explore if you’re underutilizing vendors. This is your best tool for managing headcount growth to meet headcount budget expectations
Trunk team is usually one team, but in some cases you may find it’s easiest with two teams: a centralized trunk team and an embedded trunk team that supports your heaviest consumers of capacity. In this case the embedded model is about providing higher perceived quality of service while reducing support and forcing the requesting team to self-prioritize their asks
There are certainly more operational details worth considering, but if you start with these you’ll be on a good path.
Even Good Solutions Have Flaws
Having deployed the Trunk and Branches model at both Uber and Stripe, I’ve run into a number of concerns from folks who believe it doesn’t work or that it’s an unreasonably painful way to operate. In this section, I want to address some of the most frequent concerns. I wholly agree with these identified problems–it’s a deeply imperfect model–but proposed alternatives usually superficially address the fundamental tradeoffs: all approaches have flaws, but good approaches work.
The most common concerns are:
“Working in the trunk team is too difficult to retain engineers.” I touched on this above, but this is a real challenge that requires leadership focus. Some folks love the lightly controlled chaos on a trunk team, but others hate it. For the latter, you may need to rotate them out of the team after six to twelve month stints. You may need to offer a bonus stipend to folks on the trunk team. You may need to offer increased time off. No matter what else you do, you’ll need to spend time communicating how valuable their work is directly to the trunk team and consistently in each of your wider communications to the organization. This is hard, but it’s doable with attention and creativity
“It’s inequitable to concentrate the burden on the trunk team.” I’m deeply sympathetic that it’s uncomfortable to ask the trunk team to absorb the long-tail of obligations while allowing the new branch teams to focus. This does feel unfair. However, your obligation as an infrastructure leader is to guide the organization out of the unbalanced mode of operation. Preserving an unstable operating mode to maximize short-term equality is a short-sighted path that prefers “everyone is permanentlyin a difficult working scenario” over “everyone is permanently in a good working scenario” to avoid a fixed-length period of interim complexity. I just cannot understand that mentality! Commit to the transition and then work to ameliorate the interim period’s challenges
“Innovation teams shouldn’t be burdened with operational concerns.” This concern is generally raised by folks who want to be on an innovation team who only does investment work. They view operational work as second-class work that would distract truly innovative engineers like themselves from the most rewarding, impactful work. My experience is that innovation teams who aren’t exposed to the operational concerns of real systems tend to build the wrong thing. Exposing branch teams to a concentrated set of operational concerns within their scope exposes them to their customer and their customer’s eral problems. This significantly derisks execution and takes some burden off the trunk team. I understand how folks land on this perspective, but I continue to view it as a self-serving perspective rather than one that contributes to company, organization, or team success
“Just hire Site Reliability Engineers to solve this.” In modern companies, SRE is a software engineering role with specialized expertise in some aspect of running complex systems (reliability, scalability, etc). Following that definition, SREs can be a critical part of both trunk and branch teams. To the contrary, I find that folk who raise this concern tend to view SREs as operational capacity to offload manual work off “higher value” infrastructure engineers that can automate workloads. In some cases adding manual capacity to your team is a valuable strategy, but introducing a new role is a burdensome solution to what ought to be a temporary problem if you’re maintaining an appropriate investment budget
“This only works in a very fast growing organization.” One of the gifts of rapid growth is that it’s very easy to identify problems because they get so bad, so quickly. Slower growing companies go awry more gently, which can be harder to diagnose. This model does make a general assumption about headcount growth–that it goes up–and although it technically fits an organization without headcount growth (you spin off a fixed number of branch teams), it’s not particularly interesting, and you’ll need to introduce some mechanism for reprioritizing branch teams (and potentially for reconstituting their membership)
“This isn’t ambitious enough for an organization with slow growing technical challenges.” I generally agree with this critique, although with sufficiently slow growing technical problems, there’s little incentive for moving beyond the initial infrastructure team. Trunk and Branches doesn’t have much of anything to say about that scenario
Despite all those concerns, and having deployed the trunk and branches model twice, I still think it’s the best available option to operate with when you find yourself scaling a small infrastructure team into an infrastructure organization.