My skepticism towards current developer meta-productivity tools.

November 18, 2020. Filed under productivity 3

It’s hard to write about engineering leadership in 2020 and not mention the research from Accelerate and DORA. They provide a data-driven perspective on how to increase developer productivity, which is a pretty magical thing. Why aren’t they being used more widely?

There are three core problems I see:

  1. The nefarious trap of using productivity measurements to evaluate rather than learn
  2. Instrumenting productivity pipeline requires operating across so many different tools
  3. Most instrumentation and dashboarding tools force you to model the problem poorly

Together these create enough friction that most teams never get around to using these metrics, even the ones who know they should.

Learning over evaluation

There are a number of engineering productivity measurement startups out there, many of which came into existence in the past two to three years. Wouldn’t you, as an engineering manager, love to know who is falling behind on their commit velocity? Won’t stack ranking at your next calibration session be easy when you have everyone’s commits per month for the past quarter?

I’m personally convinced that these companies are selling products that harm the companies that use them rather than help them. Using productivity metrics to measure individuals this way is akin to incident retrospects that identify human error as the root cause. It’s performative, and if you want to blame someone then just go ahead and blame someone, don’t waste your time getting arbitrary metrics to support it.

The real need here is capturing the data to support learning, and learning happens in batches. It’s useful to look at how the defect rate compares across teams, as long as you dig into understanding what those teams do differently. Maybe they have a different testing or code review process, maybe they have a different tenure or seniority mix.

As long as tooling keeps privileging the manager who wants to grade their team rather than learn from the development process, these tools will be actively distrusted by the engineers who instrument them and create false confidence in managers using them. The right tool here should be designed exclusively from a learning perspective.

Instrumenting so many tools

Resources like Buritica’s A primer on engineering delivery metrics help give the general shape of the approach, but there are so many details to work through. Productivity pipelines include so many tools that it’s unwieldy to instrument them all in a consistent way. A small startup might use Github, Docker, Terraform, Jenkins, and Kubernetes, each of which introduces complex questions to answer. Is the unique identifier the pull request? What if our builds get backed up and we deploy two pull requests in a single build? How do we distinguish between a Jenkins crash caused by an out of disk space error versus failed tests? Is a Kubernetes deployment finished when all pods are upgraded? What if one pod fails? What if that failure later turns out to be a node failure unrelated to the code? How do we get insight into workflow before the pull request is created?

Each of these questions is answerable, but they take a lot of time to work through and tie together into a cohesive view of reality. It’s also easy to instrument them in ways that create subtle misunderstandings of what’s happening. This morass of details to work through discourages many teams from finishing the work.

The best tool will automatically integrate with most of these common tools while also offering a fully customizable client, something along the lines of Datadog integration strategy. Even just having a grounded recommendation on which unique identifiers to use throughout the process would be very helpful.

Need observability over monitoring

In the monitoring and observability space, Honeycomb and Lightstep and have pushed a definition of observability centered on supporting ad-hoc rather than precomputed queries. Monitoring tools like Grafana/Graphite/Statsd might push you to emit measures that are pre-aggregated to support certain queries like “what’s the p50 latency by service, by datacenter?” Observability tools like Honeycomb and Lightstep push you to emit events which can are then aggregated at query-time, which supports answering the same question you’d ask Grafana, but also questions like “what are the latencies for requests to servers canarying this new software commit?” or even “show me full traces for all requests running this commit.”

Too many of these developer meta-productivity tools focus on monitoring-style solutions which are a mediocre fit for measuring productivity and an even poorer fit to support learning. This is a shame because the infrastructure challenges that drive infrastructure monitoring towards monitoring simply don’t exist when looking at human-scale events.

For example, if you’re measuring server response times across a fleet of Kubernetes nodes running ten pods per node, then you might be looking at 100 requests per second * 10 pods per node * 1000 nodes * 3 availability zones, which is three million measurements per second that you need to record. This drives tradeoffs towards reducing the quantity of data being stored. The observability infrastructure required to store all those events without aggregating at a reasonable price is complex and bespoke. (Roughly my understanding is that most observability tools instead capture events in a ring buffer and upon eviction from the ring buffer interesting events and summarized data is transitioned to more durable storage .)

However, the scale that these developer meta-productivity tools operate at is so much smaller that there’s no need to solve the underlying infrastructure problems. Just write it to MySQL or Kafka (streaming to S3 for historical data) or something. There’s no reason that the underlying events should be available to support deeper understanding.

The fundamental workflow I’d like to see these systems offer is the same as a request trace. Each pull request has a unique identifier that’s passed along the developer productivity tooling, and you can see each step of their journey in your dashboarding tooling. Critically, this lets you follow the typical trace instrumentation workflow of starting broad (maybe just the PR being created, being merged, and then being deployed) and add more spans into that trace over time to increase insight about the problematic segments.

So, who cares?

I’m writing this as (a) the hope that there are folks out there working on this problem from this perspective, and (b) a reusable explanation of what sort of developer meta-productivity tools I’m excited about when folks email me for feedback/angel investment/etc.

For what it’s worth, I’m not necessarily saying I think this will be a good business. I’m confident it’s the tool that engineering leadership teams need to more effectively invest into developer productivity, but I’m less confident it’s the sort of thing people want to pay for; figuring out the go-to-market and distribution strategy is probably the hardest part of this sort of product. This shows another advantage of the observability/traces/spans approach is that you can import the event history and show value immediately instead of having to have folks use the tool to build out new metric aggregations.

Ultimately though, I think companies like Github are going to be the best positions to make progress on this sort of thing, especially as Github Actions take up more room in the developer workflow. (I’m not quite sure what Nicole Forsgren is working on in her new role at Github, but I have a dream that it’s at least related to this.)