December 9, 2018.
Technical infrastructure is never complete. System processes can always run with less overhead or be bin-packed onto fewer machines. Data can be retrieved more quickly and stored at a cheaper cost per terabyte. System design can broaden the gap between failure and user impact. Transport layers can be more secure.
The sheer variety of investable projects is overwhelming. There are always new technologies to adopt or finish adopting: Docker, Kubernetes, Envoy, GKE, HTTP/2, GraphQL, gRPC, Spark, Flink, Rust, Go, Elixir are just the beginning of your options. Add cloud vendor competition, and the rate of change is pretty staggering.
With such a broad problem domain filled with so many possible solutions, I've sometimes found it difficult to provide guidance for infrastructure teams to prioritize their work. Originally, I thought this was because I lacked depth in some facets, but I slowly came to realize it was equally difficult for the teams themselves to prioritize their own work: there were simply too many options.
A couple of years ago I put together an infrastructure planning framework I referred to as the five properties of infrastructure, which worked well at a certain degree of complexity, but recently hasn't been providing the degree of guidance we needed. As such, over the past couple of months I've iterated onto a second framework: users, baselines and timeframes.
Let's dig into the original and new framesworks!
When we put together our first infrastructure planning framework, the issues we wanted to provide clearer guidance around were:
Our solution was a ranked list of infrastructure properties, all of which we'd do some work on, doing the most for those higher in the list:
This wasn't a perfect framework, but it was quite useful during planning sessions. In particular, it was good at balancing focus across internal and external users, and avoiding a common infrastructure pitfall of self-dealing during prioritization.
However, there were some important questions that this framework didn't help answer:
So we went back to the drawing board to build on what worked.
Before jumping into the next framework, a quick comment on creativity. Some of the most valuable work we do as leaders is to avoid accepting trade offs as inevitable, and instead treat them as accidental.
It's often possible, with subject matter expertise and creativity, to "have both." I think of this as getting paid twice, and this is one of the important arguments for why you should avoid teams becoming exhaustively busy: tired folks are not creative.
As we sat down to iterate on our infrastructure prioritization framework, the things we particularly wanted to improve were:
We spent some time reflecting on what did and didn't work well in the previous iteration, as well as those goals for improvement. Emerging from that reflection, the system we developed is:
Now we start the planning cycle with a list of user asks, and then merge in projects needed to maintain our baselines across appropriate timeframes. When users ask to understand why we're doing infrastructure work instead of an alternative project, we can explicitly connect our investments to business value.
While this isn't a perfect tool, I've found it to be a powerful and effective way to explain priorities and tradeoffs.
One of the most interesting early questions that came up during developing this guidelines was whether infrastructure teams should list themselves as each other customers.
For the overall organization, I believe the answer here is generally no. The measure of an infrastructure organization's impact is the discounted future product development velocity it will enable, and something this model emphasizes is that all self-investment is done on behalf of increased future product velocity.
Conversely, at the team level within an infrastructure organization, I think it's very important to recognize the other infrastructure teams as your users. This is particularly true because good infrastructure offerings are composable, building upon other infrastructure offerings to allow folks to pick appropriate productivity:property tradeoffs for their projects.
When folks read frameworks such as these, the—quite appropriate—first question is generally, "Do you actually do this?" In this case, we're winding down our latest planning process for the team I work with, and what I've written here is a fairly accurate, somewhat abstracted, and slightly idealized version of what we did.
The one omitted bit is that we also layer on a simple resourcing framework to ensure we're investing into sustainable operable systems and support investing in the kinds of long-term projects that will shift our baselines in ways that tactical work doesn't.
The current iteration of that resourcing framework is:
The framework is super simple, but I've found it works well. In particular, it ensures that the right questions get asked during the resourcing process, which is the ultimate goal of resourcing and planning discussions.
I'd be very curious to hear from y'all how your teams do infrastructure planning. What has worked well for you? What hasn't? What are your biggest challenges?