Measures of engineering impact.
My engineering leadership circle met yesterday, and like usual, we talked about our current challenges before segueing into a deeper discussion. This time, Jack Danger brought up the challenge of measuring engineering impact, which is a fascinating topic that most engineering leaders have to tackle.
Measures of engineering impact are not Accelerate’s measures of developer productivity: lead time, batch size, failure rate, and time to revert. Those measures support understanding and optimizing your development process but aren’t very effective at grading business impact. That’s partially because there are many ways to score highly against those measures without creating much business impact, and partially because they don’t resonate much with folks outside of the engineering organization. (Conflating efficiency with impact is also my general frustration with the current crop of manager tooling attempting to measure developer meta-productivity.)
Some examples of how organizations measure engineering impact:
- Amazon: # press releases (h/t Jack Danger)
- Square: # new billable features (h/t Jack Danger)
- Gusto: # competitive advantages created, # competitive advantages improved, # table stake features (h/t Jack Danger)
- Uber: # projects* shipped per quarter per team. Projects are not just product-focused: migrations, tech debt removal all count as projects, as long as they have an impact on the team (h/t Gergely)
Looking at the Amazon, Square, Gusto, and Uber impact measures, there are a few characteristics that make them effective:
- They’re straightforward–if you buy into the trope that leadership is mostly about repetition, then these are the sorts of measures that you can recite at the start of every org meeting without taking up too much time. They don’t take much explanation to convey
- They center on continued innovation, which is the implicit religion of growth-oriented businesses. There are many valid concerns with the innovation emphasis, but these goals correctly align with their companies’ values that emphasize creation over sustenance
- They are relatively difficult to “game” in dysfunctional ways as long as they’re used by a hands-on leadership team. Most of those would be easy to detect over time–if you release a very mediocre press release, people will notice that and take corrective action
They’re not perfect, but I’m pretty confident any engineering organization with more than one layer of management would be better off having something along these lines than not having them. This sort of goal is to some extent your minimum viable engineering strategy.
Personally, I’ve also spent a bunch of time thinking on this topic, and within Calm Engineering, we’ve been doing something similar for the past year, with a focus on:
- # big bets - these are new, differentiated features released to users. While some parts of this are a shared goal across pretty much all organizations within the company, we hold ourselves accountable as one of (if not the) core constraints to delivery
- # experiments - how many features and optimizations have we tested, with a target distribution for winning, neutral, and losing outcomes. There is a set target to help balance quantity and ambition
- # fires - how many incidents have we had? We’re not trying to drive this to a target, rather are trying to avoid a slope change. Overall, I do believe fires are important to factor into overall productivity as a countervailing measure
- 1-2 technical investments: how many technical investments or explorations are we doing? Organizations have a limited bandwidth to absorb technical changes, so we grade ourselves on having a few but not many
Comparing Calm’s to those used by the other companies above, the obvious flaw is that they’re simply too many and too complex. I think this is mostly just a matter of optics, as Square and Amazon obviously track more than one measure of impact, it’s just a matter of different granularities.
Shifting topics a bit, it’s also interesting to think about how measuring engineering impact does or doesn’t simplify the problem of measuring infrastructure engineering impact. The challenge of measuring infrastructure impact consumed much of my last two years at Stripe, as I pursued the goal of right-sizing infrastructure investment relative to infrastructure impact.
That topic led to several explorations on how to measure impact within an organization whose primary contribution is enablement:
One way or another, all of these approaches have rough edges. Some folks argue that infrastructure organizations should measure themselves as if they were Heroku: user adoption, reliability, cost to serve, etc. I’ve tried that: it’s a beautiful dream and parts of it work. The biggest challenge is that infrastructure groups typically become their company’s vehicle for enforcing various global trade offs like GDPR, data locality, security, high-availability, and so on.
In organizations that reward or tolerate local optimization, infrastructure groups become the unwilling avatar of global optimization and are frequently the intermediary between company goals (e.g. GDPR compliance) and misaligned pockets of locally optimizing users. You can grade that infrastructure group on user satisfaction, but in practice these are users that you would fire as an independent company (or more likely would have never bought your services to begin with), so it’s a bit of an awkward exercise.