I’m in the early stages of working with my friend, Rachael Stedman,
on an “Ask an engineering leadership” project where we try to answer folks challenging engineering leadership questions.
This is one of the questions that came in that wasn’t a perfect fit for that project
(we’re still callibrating a bit on what is a perfect fit), but still a question I wanted to take a stab at answering.
I work as an Engineering Manager at a large organization, supporting team managers across four teams who work on the shared infrastructure and platform for the various web products. The department is new, and is a result of a strategic shift to complement existing engineering departments which traditionally managed their own separate infrastructure and stacks. My joining this new department about nine months ago also coincided with me stepping up into the manager-of-managers role after a number of years as a team manager in another organization.
The nascent platform was initially developed from a ‘spike’, and I have inherited a set of teams, responsible for different layers in the platform stack. These teams were existing teams who have come together from various other departments. I’ve faced the challenge of building cohesion between these teams to ensure we are building a cohesive platform. We’re also now seeing the number of teams adopting our platform accelerate significantly, with projections of 250+ engineers developing on our platform in the coming months.
Our tenant teams are now facing into the frustrations of the shared platform, which inevitably constrains them more than their old world. I am confident our strategy lies in building self-service tooling, and we have the promise of being able to accelerate all of these teams. However, the rapid adoption means these benefits lie beyond the maturity of the platform as it stands just now and the frustrations are real and present. This frustration is starting to cause significant friction, with lots of door stepping for support requests and a sense amongst tenant teams of my teams acting as ‘the police’ due to more human intervention than we’d like.
The friction distracts my team from their strategic goal of maturing the platform. I fear that tenant teams will circumvent the platform in order to avoid this friction. If this happens, then we will never realise the potential benefits of the shared platform. How can we build buy-in from the wider organization and ensure we are seen as an accelerator amongst the tenant engineering teams - while also maturing the platform?
The first thing I’d say is that this is a fundamental tension that all internal platform teams experience.
It’s not just you, this is a common growing pain. With that said, there’s a lot you can do to reduce the pain.
It sounds like you’re in the midst of a prolonged migration,
and one of the core tenents of the migration is to drive fit before you drive adoption.
Many platform teams have goals that incentize migrating as many users onto your platform as possible,
even if it’s an unhappy adoption.
This will spike your early adoption numbers, but stall out fairly quickly.
Instead the preferrable approach is to bring on one or two challenging users
and iterate with them until the platform solves their needs. In parallel, bring on easier users at a fixed rate
with a focus of streamlining the onboarding process. (Absolutely not the goal of driving up platform adoption.)
Until you have some complex teams who have successfully adopted the platform and are happy using it,
driving adoption will only lead to sorrow.
It sounds like you are in that sorrow right now, which is totally understandable: most platform teams are there are some point.
If all of your users are unhappy, it’s possible that you started adoption a bit too early and you’re just going to have a rough go of it,
but at least that means that there are only a few things you need to solve before most folks will be happier.
If only some of your users are unhappy, maybe you can fire them for the time being. Yes, it’s embarassing to have them churn off your
platform, but your goal is maximizing long-term adoption not maximizing short-term adoption, and this is a case where you can sacrifice
your long-term goal if you are afraid of letting a few particularly challenging users churn off your platform (back to whatever they used before).
Often when folks are frustrated with lack of power, it turns out to be an interface design issue,
and I started to focus on the idea of providing pierceable abstractions
to allow users to bypass the abstractions as necessary to reach another layer of complexity (e.g. from container to VM).
It’s not that you want them to bypass the abstraction, just that it’s hard to prioritize every teams' specific needs,
and letting them bypass allows you to address their problem on your schedule rather than their schedule.
Once you identify a strategy to allow the teams to solve their own problems, whether it’s a pierceable abstraction
or the self-service tools you’ve described, then it’s a matter of scraping together enough engineering time to
actually get there. This is a case where I think what feels fair and what works are a bit different, and in my opinion
what works is actually the fairest thing you can do since it avoids the scenario of everyone being miserable together forever.
More concretely what this means is that I’d recommend having a subset of the team, it can be a rotation or whatever,
who simply soak up the incoming requests and do their best to support them. They’re going to be overloaded and going to
do a mediocre job. They’re going to be annoyed about it. The rest of the team has to then be completely focused
on the work that will relieve the load on the folks soaking up the incoming requests.
I’d recommend using a service cookbook to make tracking requests even easier
to ensure you’re prioritizing the right work.
The above scenario is basically what I was hired into at Uber with a four engineer Core Operations team that was trying to
support 200 engineers and falling further behind every day. We dug out through ruthless prioritization, some pretty tough quarters,
and building a fantastic self-service platform that eventually offloaded all the manual work we were doing.
Then folks went off and built ten thousand services using it, which is a story for another day.
Some of the stuff I’ve written related to this topic: