Migrations: the sole scalable fix to tech debt.
The most interesting migration I ever participated in was Uber’s migration from Puppet-managed services to a fully self-service provisioning model where any engineer at the company could spin up a new service in two clicks. Not only could they, they did, provisioning multiple services each day by the time the service was complete, and every newly hired engineer spinning up a service from scratch their first day.
What made this migration so interesting was the volume. When we started, provisioning a new service took about two weeks of clock time and about two days of engineering time, and we were falling further behind each day. At the time it was a more-than-just-a-little stressful, but it was also a perfect laboratory to learn how to run large-scale software migrations: it was large enough to see even small shifts, and long enough that we got to experiment with a number of approaches.
Migrations are both essential and frustratingly frequent as your codebase ages and your business grows: most tools and processes only support about one order of magnitude of growth before becoming ineffective, so rapid growth makes them a way of life. This isn’t because they’re bad processes or poor tools, quite the opposite: the fact that something stops working at significantly increased scale is a sign that it was designed appropriately to the previous constraints rather than being over designed.
As a result you switch tools a lot, and your ability to migrate to new software can easily become the defining constraint for your overall velocity. Given their importance, we don’t talk about running migrations very often; let’s remedy that!
Why migrations matter
Migrations matter because they are usually the only available avenue to make meaningful progress on technical debt.
Engineers hate technical debt. If there is an easy project they can personally do to reduce tech debt, they’ll take it on themselves. Engineering managers hate technical debt, too. If there is an easy project their team can execution in isolation, they’ll get it scheduled. In aggregate, this leads to a dynamic where there is very little low-hanging fruit to reduce technical debt, and most remaining options require many teams working together to implement them: migrations.
Each migrations aims to create technical leverage (“your indexes no longer have to fit on a single server!”) or reduce technical debt (“your acknowledged writes are guaranteed to persist a master failover”) . They occupy the awkward territory of reduced immediate contribution today in exchange for more capacity tomorrow. This makes them controversial to schedule, and as your systems become larger, they become more expensive.
Lore tells us that Googlers have a phrase, “Running to stand still”, to describe a team whose entire capacity is consumed in upgrading dependencies and patterns, such that it can’t make forward progress on the product/system they own. Spending all your time on migrations is extreme, but every mid-sized company has a long queue up migrations it can’t staff: moving from VMs to containers, rolling out circuit-breaking, moving to the new build tool; the list extends effortlessly into the sunset.
Migrations are the only mechanism to effectively manage technical debt as your company and code grows. If you don’t get effective at software and system migrations, you’ll end up languishing in technical debt. (And still have to do one later anyway, it’s just that it’ll probably be a full rewrite.)
Running good migrations
The good news is that while migrations are hard, there is a pretty standard playbook that works remarkably well: Derisk, Enable , and then Finish.
Derisk
The first phase of a migration is derisking it, and to do so as quickly and cheaply as possible. Write a design document and shop it with the teams that you believe will have the hardest time migrating. Iterate. Shop it with teams who have atypical patterns and edge cases. Iterate. Test it against the next six to twelve months of roadmap. Iterate.
After you’ve evolved the design, the next step is to embed into the most challenging one or two teams, and work side by side with those teams to build, evolve and migrate to the new system. Don’t start with the easiest migrations, which can lead to a false sense of security.
Effective derisking is essential, because each team that endorses a migration is making a bet on you that you’re going to get this damn thing done, and not leave them with a migration to an abandoned system that they have to revert. If you leave one migration partially finished, folks will be exceedingly suspicious of participating in the next.
Enable
Once you’ve validated the solution solves the intended problem, it’s time to start sharpening your tools. Many folks start migrations by generating tracking tickets for teams to implement, but it’s better to slow down and build tooling to programmatically migrate the easy ninety-percent. This radically reduces the migration’s cost to the broader organization, which increases their success rate and creates more future opportunities to migrate.
Once you’ve handled as much of the migration programmatically as possible, figure out the self-service tooling and documentation you can provide to allow folks to make the necessary changes without getting stuck. The best migration tools are incremental and reversible: folks should be able to immediately return to previous behavior if something goes wrong, and have the necessary expressiveness to derisk their particular migration path.
Documentation and self-service tooling are products, and thrive under the same regime: sit down with some teams and watch them follow your instructions, then improve them. Find a another team, repeat. Spending an extra two days intentionally making your documentation clean and tools intuitive can save years in large migrations. Do it!
Finish
The last phase of a migration is deprecating the legacy system you’ve replaced. This requires getting to 100% adoption, and that can be quite challenging.
Start by stopping the bleeding, which is ensuring that all newly written code uses the new approach. That can be installing a ratchet in your linters, or updating your documentation and self-service tooling. This is always the first step, because it turns time into your friend. Instead of falling behind by default, you’re now making progress by default.
Ok, now you should start generating tracking tickets, and a mechanism which pushes migration status to teams that need to migrate and to the general management structure. It’s important to give wider management context around migrations because they are the folks who need to prioritize the migrations; if a team isn’t working on a migration, it’s typically because their leadership has not prioritized it.
At this point you’re pretty close to complete, but have the long tail of weird or unstaffed. Your tool now is finish it yourself. It’s not necessarily fun, but getting to 100% is going to require the team leading the migration to dig into the nooks and crannies themselves.
My final tip for finishing migrations is around recognition. It’s important to celebrate migrations while they’re ongoing, but the majority of the celebration and recognition should be reserved for its successful completion. In particular, starting but not finishing migrations often incurs significant technical debt, so your incentives and recognition structure should be careful to avoid perverse incentives.
What have you seen make migrations more effective? What are some of the anti-patterns you’ve experienced?