Your migration probably isn’t failing due to insufficient staffing.

Published on May 5, 2022. management (218), infrastructure (57)

Chatting with a friend recently, their company was running into a common developer productivity pitfall. The company had mandated a migration away from their monolithic architecture and mono repo, but the migration was stalling out. To speed up the transition, the responsible infrastructure team decided to stop supporting the monolith and instead focus on the new service environment. Two years later, engineers were quitting to avoid working on either side of the migration: both the new, incomplete, services ecosystem and the old, stagnant, monolithic ecosystem. Engineers had started to resent the infrastructure team and their struggling migration. The infrastructure team was annoyed as well, particularly at engineering leadership who was sabotaging this critical migration by under-resourcing it.

Migration stories are particularly interesting to me because I believe large-scale migrations are the only scalable approach to technical debt. They’re also interesting because bad migrations can go so very wrong. In this case, their engineering organization had, in just two years, lost decades of engineering velocity, and accelerating attrition had them on track to lose decades more. Decades!

In most organizations, centralized infrastructure and developer productivity organizations take on the lion’s share of migrations. When a migration goes awry, the most frequent self-limiting belief I’ve encountered among the folks leading the work is that they’re failing due to the lack of headcount or resourcing. I consider blaming headcount a self-limiting belief because it’s the least helpful explanation for why something is going wrong; even if it’s partially true, there’s always a more interesting explanation to find.

Alright, let’s survey a few topics here:

Why lack of headcount isn’t a useful explanation
Finding better ways to diagnose struggling migrations
Your options as a leader responsible for a struggling migration

Let’s dig in.

“We need more headcount”

One tool I’ve found particularly helpful in leading large organizations is the business review template. Leaders write up a summary of their area, and contextualize their progress against the past and their current goals. Most versions of these documents include a section on, “What is slowing you down?”, and I’ve rolled out and operated this process frequently enough to know that area leaders very often initially answer this question by saying that they, “need more headcount to complete their goals.”

My advice is to say almost anything else. There are a few different reasons why I find headcount an unhelpful answer:

Headcount is just another word for budget, so this is roughly equivalent to saying, “if we just invested more money here, we wouldn’t be having problems.” This is only an interesting answer if you have good reason to believe that your current strategy and execution are excellent. Otherwise, investing more money is a highly inefficient approach
Similarly, your planning process should have indexed your goals against your planned headcount. If you have the headcount that you planned your goals against, then adding more headcount is an admission that you’re not tracking to plan. This means your strategy or execution aren’t working for your current problem and headcount constraints. There must be a more interesting change to your approach, or to the problem itself, than simply forcing the current approach forward with more people working on it

It’s my experience that leaders of “under resourced” infrastructure teams either don’t understand why their migration is going poorly (discussed in the next section), or have forgotten one of the key lessons I’ve learned working in that area:

If you don’t do a good job of picking problems and solutions, then a well run company will destaff your team. It’s an earned and maintained opportunity to have significant discretionary budget.

If your infrastructure organization has too few resources, then you’re probably not showing enough value to whoever is setting the organizational budget. Do they know what you’re working on? Do they agree it’s important? How are you deliberately validating your plans against what they believe is important? How are you working to educate them on the organization’s current needs from your perspective? Spend most of your time driving impact towards the goals that leadership cares about, and budget problems get much easier.

Debugging a migration

The other issue with asking for headcount is that, even if headcount were approved, in the short term you won’t have more people helping out, and you’ll probably get slower as you do additional work interviewing and training those hires. Either way, you have to debug and improve the migration with the team you have.

If you’re not sure why the migration is going poorly, here are some questions to ask:

Where are folks getting stuck in the migration process? If you don’t have clear instrumentation on the migration funnel, then it’s very difficult to debug what’s going wrong. For any sufficiently large migration, you should have clarity on the total number of migration points, how many have attempted to start the migration, and progress along the entire funnel (scaffolded new repository, provisioned in dev environment, setup on-call rotation, provisioned in production, etc)
Are folks not starting the migration? If folks aren’t even starting the migration, then they don’t believe it’s useful to their needs, or they are getting counter-signal from somewhere that migrating is a bad idea (maybe from folks who previously tried to migrate and had a bad time of it)
What cohorts are and aren’t migrating successfully? You’ll often find cohorts of use cases where the migration is and isn’t working. For example, teams that have a strong habit of debugging live on production servers tend to resist any migration that forces them to instead debug through logs. As you understand the cohorts where your migration is succeeding and failing, you can better focus your efforts on the struggling areas (it may feel better to double down migrations on areas where it’s already going well, but it doesn’t get you any closer to finishing)
For teams that aren’t migrating, what are they doing instead? How is that going? When teams resist adopting a migration, they’ve often found a better solution to their specific problems. Maybe this was a solution that you initially dismissed when you began the migration, or maybe the actual problems your migration was based on aren’t as important as you initially believed. For example, maybe your monolith is actually the right solution for most product code and only high-volume, low-latency components should migrate into their own services. (This is a potential place to leverage your developer productivity survey to get a wide view across the organization.)
For teams who are migrating, where is the migration bottleneck, and what would resolve that bottleneck? At any given point, there will typically be one or two points in the migration funnel that are slowing folks down. Solve those and you’ll unlock the whole migration. For example, folks might be requesting you to perform a migration step without supplying all the necessary information you need to resolve it, leading to a lengthy back and forth that can be prevented with better documentation, structured ticket fields, and ensuring the documentation is visibly linked wherever the requests are made

As you identify the answer to these questions, you can make more nuanced requests for help: maybe you need the organization to approve a modified migration path, to slow down the migration, to only migrate certain kinds of workloads, or whatever. You might not even need the organization to approve anything, instead you might just need to shift your team’s week to week priorities to resolve the current bottlenecks.

When you’re responsible

If you’re running a migration, your only option is to diagnose the problem and adapt strategy to your current constraints, which I think of as “working the problem.” If you’re the leader of an organization struggling with a stalled out migration, you’ll have a few more options to weigh.

The good news is that failing migrations are so common that there’s a fairly clear set of playbooks to choose from. The four most common migration-recovery playbooks I’ve seen are:

“Working the problem.” When you start a migration you know a certain number of things, but you learn much, much more as you go. Keep adjusting your approach to incorporate what you’ve learned. This is how we worked through Uber’s self-service migration away from the monolith: we tried a “migration pilot” model where one person would soak all the migration requests for one week to protect the rest of the team’s time, we pushed a combined self-service and automation model that greatly reduced complexity, we created scaffolding that reduced errors (e.g. missing configs, etc) in new services, we created linters that proactively surfaced common errors, and so on. We kept iterating rapidly on the migration tooling until the pressure receded. The core team working on this self-service provisioning and operating tooling grew only very modestly from two folks to about four over two years
“Don’t migrate.” I joined Stripe after leading Uber’s migration away from their monolith, and initially was anchored on the idea that a similar migration would be valuable for Stripe. However, I kept quiet about my ambitions while understanding Stripe’s setup, and it became clear over several months that the context was very different. I ended up following a totally different migration approach: not migrating at all. While it certainly wasn’t my decision alone, I’m fairly confident I could have bullheadedly forced through a grand migration to services, but I’m even more confident that by pushing that forward I would have directly caused the company to lose decades of engineering productivity (for the record, that’s in no way intended to say that later Stripe was wrong to move towards services, just that it needed to happen at a time that met the magnitudes of exploration guideline)
“Cancel the migration.” I once joined an organization that was in the process of migrating from a Nodejs monolith to Go microservices. As I dug into the details, it was stuck, polarizing, and it was misaligned with our biggest problem: it prioritized the infrastructure scalability of our already low-cost infrastructure over our most pressing need to increase product velocity. I canceled the migration, which was a contentious decision. Instead, the technical leadership group took time to rethink their goals and ended up with a different approach, gradually migrating from Javascript to Typescript, which successfully completed over the next year
“Increase headcount for the current migration strategy.” You could, of course, just staff the current migration strategy the way that the owning team is asking to be staffed. I include this strategy to be logically complete, but I’ve never personally seen a struggling migration recover by following this approach. Fundamentally the decision making process that selected a migration dependent on non-existent resources usually extends beyond a pure headcount issue and into deeper technical issues as well

While I can’t recommend the fourth playbook, the others are very effective after some initial discomfort. Stalled migrations are extraordinarily disruptive, and if you’re an engineering leader and your organization is running a failing migration, then it’s ultimately your responsibility to debug and resolve it. If you’re uncertain about what to do, I’ll leave you with advice I got early in my career: you’re almost always better off decisively taking a reasonable decision earlier than prolonging the decision.