Irrational Exuberancehttps://lethain.com/Recent content on Irrational ExuberanceHugo -- gohugo.ioen-usWill LarsonThu, 17 Apr 2025 06:00:00 -0700Why did Stripe build Sorbet? (~2017).https://lethain.com/stripe-sorbet/Thu, 17 Apr 2025 06:00:00 -0700https://lethain.com/stripe-sorbet/ <p>Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by <a href="https://lethain.com/decompose-monolith-strategy/">decomposing their monoliths</a>. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it&rsquo;s safe to say this was only possible because of Sorbet&rsquo;s impact.</p> <p>Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe&rsquo;s Product Infrastructure had similar goals to other companies&rsquo; Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes.</p> <p>This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring.</p> <p>Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe&rsquo;s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can&rsquo;t take credit for this strategy&rsquo;s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe&rsquo;s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reverse order, starting with <em>Explore</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy--operation">Policy &amp; Operation</h2> <p>The Product Infrastructure team is investing in Stripe&rsquo;s developer experience by:</p> <ul> <li> <p>Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas.</p> <p>We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review.</p> </li> <li> <p>Our three highest priority areas for this half are:</p> <ol> <li>Add static typing to the highest value portions of our Ruby codebase, such that we can run the type checker locally and on the test machines to identify errors more quickly.</li> <li>Support selective test execution such that engineers can quickly determine and run the most appropriate tests on their machine rather than delaying until tests run on the build server.</li> <li>Instrument test failures such that we have better data to prioritize future efforts.</li> </ol> </li> <li> <p>Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team&rsquo;s time to get our type checker to an effective place.</p> <p>Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase.</p> <p>It&rsquo;s possible for Product Infrastructure to take on both, despite its fixed size. We&rsquo;ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code&rsquo;s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable.</p> <p>Based on our knowledge of Facebook&rsquo;s <a href="https://hacklang.org/">Hack</a> project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It&rsquo;s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios.</p> <p>Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value.</p> </li> <li> <p>In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the <a href="https://en.wikipedia.org/wiki/Command_Query_Responsibility_Segregation">Command Query Responsibility Segregation</a> (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase.</p> </li> <li> <p>Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code.</p> <p>Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified.</p> </li> <li> <p>Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues.</p> </li> <li> <p>For escalations and issues, reach out in the #product-infra channel.</p> </li> </ul> <h2 id="diagnose">Diagnose</h2> <p>In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity.</p> <p>Our current diagnosis of our developer productivity is:</p> <ul> <li> <p>We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.)</p> </li> <li> <p>We have two primary mechanisms for understanding our engineer&rsquo;s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey.</p> </li> <li> <p>Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners.</p> </li> <li> <p>Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request&rsquo;s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete.</p> <p>By the time their build-test cycle completes, they&rsquo;ve lost their focus and maybe take several hours to return to addressing the results.</p> </li> <li> <p>There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes.</p> </li> <li> <p>Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase.</p> </li> <li> <p>Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity.</p> <p>Our data infrastructure is implemented in Scala. While these developers have concerns&ndash;primarily slow build times&ndash;they manage their build and deployment infrastructure independently, and the group remains relatively small.</p> </li> <li> <p>Ruby is not a highly performant programming language, but we&rsquo;ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language.</p> </li> <li> <p>Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers.</p> </li> <li> <p>Developers working in Golang and Scala routinely ask for more centralized support, but it&rsquo;s challenging to prioritize those requests as we&rsquo;re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala.</p> <p>If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.</p> </li> </ul>How to get better at strategy?https://lethain.com/how-to-get-better-at-strategy/Thu, 10 Apr 2025 05:00:00 -0700https://lethain.com/how-to-get-better-at-strategy/ <p>One of the most memorable quotes in Arthur Miller&rsquo;s <em>The Death of a Salesman</em> comes from Uncle Ben, who describes his path to becoming wealthy as, &ldquo;When I was seventeen, I walked into the jungle, and when I was twenty-one I walked out. And by God I was rich.&rdquo; I wish I could describe the path to learning engineering strategy in similar terms, but by all accounts it&rsquo;s a much slower path. Two decades in, I am still learning more from each project I work on. This book has aimed to accelerate your learning path, but my experience is that there&rsquo;s still a great deal left to learn, despite what this book has hoped to accomplish.</p> <p>This final chapter is focused on the remaining advice I have to give on how you can continue to improve at strategy long after reading this book&rsquo;s final page. Inescapably, this chapter has become advice on writing your own strategy for improving at strategy. You are already familiar with my general suggestions on creating strategy, so this chapter provides focused advice on creating your own plan to get better at strategy.</p> <p>It covers:</p> <ul> <li>Exploring strategy creation to find strategies you can learn from via public and private resources, and through creating learning communities</li> <li>How to diagnose the strategies you&rsquo;ve found, to ensure you learn the right lessons from each one</li> <li>Policies that will help you find ways to perform and practice strategy within your organization, whether or not you have organizational authority</li> <li>Operational mechanisms to hold yourself accountable to developing a strategy practice</li> <li>My final benediction to you as a strategy practitioner who has finished reading this book</li> </ul> <p>With that preamble, let&rsquo;s write this book&rsquo;s final strategy: your personal strategy for developing your strategy practice.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="exploring-strategy-creation">Exploring strategy creation</h2> <p>Ideally, we&rsquo;d begin improving our engineering strategy skills by broadly reading publicly available examples. Unfortunately, there simply aren&rsquo;t many easily available works to learn from others&rsquo; experience. Nonetheless, resources do exist, and we&rsquo;ll discuss the three categories that I&rsquo;ve found most useful:</p> <ol> <li>Public resources on engineering strategy, such as companies&rsquo; engineering blogs</li> <li>Private and undocumented strategies available through your professional network</li> <li>Learning communities that you build together, including ongoing learning circles</li> </ol> <p>Each of these is explored in its own section below.</p> <h3 id="public-resources">Public resources</h3> <p>While there aren&rsquo;t as many public engineering strategy resources as I&rsquo;d like, I&rsquo;ve found that there are still a reasonable number available. This book collects a number of such resources in the appendix of <a href="https://lethain.com/strategy-notes/">engineering strategy resources</a>. That appendix also includes some individuals&rsquo; blog posts that are adjacent to this topic. You can go a long way by searching and prompting your way into these resources.</p> <p>As you read them, it&rsquo;s important to recognize that public strategies are often misleading, as <a href="https://lethain.com/distinguishing-good-vs-bad-strategy/">discussed previously in evaluating strategies</a>. Everyone writing in public has an agenda, and that agenda often means that they&rsquo;ll omit important details to make themselves, or their company, come off well. Make sure you read through the lines rather than taking things too literally.</p> <h3 id="private-resources">Private resources</h3> <p>Ironically, where public resources are hard to find, I&rsquo;ve found it much easier to find privately held strategy resources. While private recollections are still prone to inaccuracies, the incentives to massage the truth are less pronounced.</p> <p>The most useful sources I&rsquo;ve found are:</p> <ul> <li> <p><em>peers&rsquo; stories</em> &ndash; strategies are often oral histories, and they are shared freely among peers within and across companies. As you build out your professional network, you can usually get access to any company&rsquo;s engineering strategy on any topic by just asking.</p> <p>There are brief exceptions. Even a close peer won&rsquo;t share a sensitive strategy before its existence becomes obvious externally, but they&rsquo;ll be glad to after it does. People tend to overestimate how much information companies can keep private anyway. Even reading recent job postings can usually expose a surprising amount about a company.</p> </li> <li> <p><em>internal strategy archaeologists</em> &ndash; while surprisingly few companies formally collect their strategies into a repository, the stories are informally collected by the tenured members of the organization. These folks are the company&rsquo;s strategy archaeologists, and you can learn a great deal by explicitly consulting them</p> </li> <li> <p><em>becoming a strategy archaeologist yourself</em> &ndash; whether or not you&rsquo;re a tenured member of your company, you can learn a tremendous amount by starting to build your own strategy repository. As you start collecting them, you&rsquo;ll interest others in contributing their strategies as well.</p> <p>As discussed in <em>Staff Engineer</em>&rsquo;s section on the <a href="https://staffeng.com/guides/engineering-strategy/">Write five then synthesize</a> approach to strategy, over time you can foster a culture of documentation where one didn&rsquo;t exist before. Even better, building that culture doesn&rsquo;t require any explicit authority, just an ongoing show of excitement.</p> </li> </ul> <p>There are other sources as well, ranging from attending the hallway track in conferences to organizing dinners where stories are shared with a commitment to privacy.</p> <h3 id="working-in-community">Working in community</h3> <p>My final suggestion for seeing how others work on strategy is to form a <a href="https://lethain.com/rough-notes-learning-circles/">learning circle</a>. I formed a <a href="https://lethain.com/crowdsourcing-cto-vpe-learning-circles/">learning circle when I first moved into an executive role</a>, and at this point have been running it for more than five years. What&rsquo;s surprised me the most is how much I&rsquo;ve learned from it.</p> <p>There are a few reasons why ongoing learning circles are exceptional for sharing strategy:</p> <ol> <li>Bi-directional discussion allows so much more learning and understanding than mono-directional communication like conference talks or documents.</li> <li>Groups allow you to learn from others&rsquo; experiences and others&rsquo; questions, rather than having to guide the entire learning yourself.</li> <li>Continuity allows you to see the strategy at inception, during the rollout, and after it&rsquo;s been in practice for some time.</li> <li>Trust is built slowly, and you only get the full details about a problem when you&rsquo;ve already successfully held trust about smaller things. An ongoing group makes this sort of sharing feasible where a transient group does not.</li> </ol> <p>Although putting one of these communities together requires a commitment, they are the best mechanism I&rsquo;ve found. As a final secret, many people get stuck on how they can get invited to an existing learning circle, but that&rsquo;s almost always the wrong question to be asking. If you want to join a learning circle, make one. That&rsquo;s how I got invited to mine.</p> <h2 id="diagnosing-your-prior-and-current-strategy-work">Diagnosing your prior and current strategy work</h2> <p>Collecting strategies to learn from is a valuable part of improving, but it&rsquo;s only the first step. You also have to determine what to take away from each strategy. For example, you have to determine whether Calm&rsquo;s approach to <a href="https://lethain.com/resourcing-eng-driven-projects/">resourcing Engineering-driven projects</a> is something to copy or something to avoid.</p> <p>What I&rsquo;ve found effective is to apply <a href="https://lethain.com/is-this-strategy-any-good/">the strategy rubric</a> we developed in the &ldquo;Is this strategy any good?&rdquo; chapter to each of the strategies you&rsquo;ve collected. Even by splitting a strategy into its various phases, you&rsquo;ll learn a lot. Applying the rubric to each phase will teach you more. Each time you do this to another strategy, you&rsquo;ll get a bit faster at applying the rubric, and you&rsquo;ll start to see interesting, recurring patterns.</p> <p>As you dig into a strategy that you&rsquo;ve split into phases and applied the evaluation rubric to, here are a handful of questions that I&rsquo;ve found interesting to ask myself:</p> <ul> <li>How long did it take to determine a strategy&rsquo;s initial phase could be improved? How high was the cost to fund that initial phase&rsquo;s discovery?</li> <li>Why did the strategy reach its final stage and get repealed or replaced? How long did that take to get there?</li> <li>If you had to pick only one, did this strategy fail in its approach to exploration, diagnosis, policy or operations?</li> <li>To what extent did the strategy outlive the tenure of its primary author? Did it get repealed quickly after their departure, did it endure, or was it perhaps replaced during their tenure?</li> <li>Would you generally repeat this strategy, or would you strive to avoid repeating it? If you did repeat it, what conditions seem necessary to make it a success?</li> <li>How might you apply this strategy to your current opportunities and challenges?</li> </ul> <p>It&rsquo;s not necessary to work through all of these questions for every strategy you&rsquo;re learning from. I often try to pick the two that I think might be most interesting for a given strategy.</p> <h2 id="policy-for-improving-at-strategy">Policy for improving at strategy</h2> <p>At a high level, there are just a few key policies to consider for improving your strategic abilities. The first is implementing strategy, and the second is practicing implementing strategy. While those are indeed the starting points, there are a few more detailed options worth consideration:</p> <ul> <li> <p>If your company has existing strategies that are not working, debug one and work to fix it. If you lack the authority to work at the company scope, then decrease altitude until you find an altitude you can work at. Perhaps setting Engineering organizational strategies is beyond your circumstances, but strategy for your team is entirely accessible.</p> </li> <li> <p>If your company has no documented strategies, document one to make it debuggable. Again, if operating at a high altitude isn&rsquo;t attainable for some reason, operate at a lower altitude that is within reach.</p> </li> <li> <p>If your company&rsquo;s or team&rsquo;s strategies are effective but have low adoption, see if you can iterate on operational mechanisms to increase adoption. Many such mechanisms require no authority at all, such as low-noise nudges or the <a href="https://lethain.com/model-document-share/">model-document-share</a> approach.</p> </li> <li> <p>If existing strategies are effective and have high adoption, see if you can build excitement for a new strategy. Start by mining for which problems Staff-plus engineers and senior managers believe are important. Once you find one, you have a valuable strategy vein to start mining.</p> </li> <li> <p>If you don&rsquo;t feel comfortable sharing your work internally, then try writing proposals while only sharing them to a few trusted peers.</p> <p>You can even go further to only share proposals with trusted external peers, perhaps within a learning circle that you create or join.</p> </li> </ul> <p>Trying all of these at once would be overwhelming, so I recommend picking one in any given phase. If you aren&rsquo;t able to gain traction, then try another approach until something works. It&rsquo;s particularly important to recognize in your diagnosis where things are not working&ndash;perhaps you simply don&rsquo;t have the sponsorship you need to enforce strategy so you need to switch towards suggesting strategies instead&ndash;and you&rsquo;ll find something that works.</p> <h3 id="what-if-youre-not-allowed-to-do-strategy">What if you&rsquo;re not allowed to do strategy?</h3> <p>If you&rsquo;re looking to find one, you&rsquo;ll always unearth a reason why it&rsquo;s not possible to do strategy in your current environment.</p> <p>If you believe your current role prevents you from engaging in strategy work, I&rsquo;ve found two useful approaches:</p> <ol> <li> <p><em>Lower your altitude</em> &ndash; there&rsquo;s always a scale where you can perform strategy, even if it&rsquo;s just your team or even just yourself.</p> <p>Only you can forbid yourself from developing personal strategies.</p> </li> <li> <p><em>Practice rather than perform</em> &ndash; organizations can only absorb so much strategy development at a given time, so sometimes they won&rsquo;t be open to you doing more strategy. In that case, you should focus on <em>practicing</em> strategy work rather than directly performing it.</p> <p>Only you can stop yourself from practice.</p> </li> </ol> <p>Don&rsquo;t believe the hype: you can always do strategy work.</p> <h2 id="operating-your-strategy-improvement-policies">Operating your strategy improvement policies</h2> <p>As the refrain goes, even the best policies don&rsquo;t accomplish much if they aren&rsquo;t paired with operational mechanisms to ensure the policies actually happen, and debug why they aren&rsquo;t happening. It&rsquo;s tempting to overlook operations for personal habits, but that would be a mistake. These habits profoundly impact us in the long term, yet they&rsquo;re easiest to neglect because others rarely inquire about them.</p> <p>The mechanisms I&rsquo;d recommend:</p> <ul> <li> <p>Clearly track the strategies you&rsquo;ve implemented, refined, documented, or read. Maintain these in a document, spreadsheet, or folder that makes it easy to monitor your progress.</p> </li> <li> <p>Review your tracked strategies every quarter: are you working on the expected number and in the expected way? If not, why not?</p> <p>Ideally, your review should be done in community with a peer or a learning circle. It&rsquo;s too easy to deceive yourself, it&rsquo;s much harder to trick someone else.</p> </li> <li> <p>If your periodic review ever discovers that you&rsquo;re simply not doing the work you expected, sit down for an hour with someone that you trust&ndash;ideally someone equally or more experienced than you&ndash;and debug what&rsquo;s going wrong. Commit to doing this <em>before</em> your next periodic review.</p> </li> </ul> <p>Tracking your personal habits can feel a bit odd, but it&rsquo;s something I highly recommend. I&rsquo;ve been setting and tracking personal goals for some time now—for example, in my <a href="https://lethain.com/2024-in-review/">2024 year in review</a>—and have benefited greatly from it.</p> <h3 id="too-busy-for-strategy">Too busy for strategy</h3> <p>Many companies convince themselves that they&rsquo;re too much in a rush to make good decisions. I&rsquo;ve certainly gotten stuck in this view at times myself, although at this point in my career I find it increasingly difficult to not recognize that I have a number of tools to create time for strategy, and an obligation to do strategy rather than inflict poor decisions on the organizations I work in. Here&rsquo;s my advice for creating time:</p> <ul> <li>If you&rsquo;re not tracking how often you&rsquo;re creating strategies, then start there.</li> <li>If you&rsquo;ve not worked on a single strategy in the past six months, then start with one.</li> <li>If implementing a strategy has been prohibitively time consuming, then focus on practicing a strategy instead.</li> </ul> <p>If you do try all those things and still aren&rsquo;t making progress, then accept your reality: you don&rsquo;t view doing strategy as particularly important. Spend some time thinking about why that is, and if you&rsquo;re comfortable with your answer, then maybe this is a practice you should come back to later.</p> <h2 id="final-words">Final words</h2> <p>At this point, you&rsquo;ve read everything I have to offer on drafting engineering strategy. I hope this has refined your view on what strategy can be in your organization, and has given you the tools to draft a more thoughtful future for your corner of the software engineering industry.</p> <p>What I&rsquo;d never ask is for you to wholly agree with my ideas here. They are my best thinking on this topic, but strategy is a topic where I&rsquo;m certain Hegel&rsquo;s world view is the correct one: even the best ideas here are wrong in interesting ways, and will be surpassed by better ones.</p>Wardley mapping the service orchestration ecosystem (2014).https://lethain.com/wardley-compute-ecosystem/Thu, 10 Apr 2025 04:00:00 -0700https://lethain.com/wardley-compute-ecosystem/

<p>In <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s 2014 service migration strategy</a>, we explore how to navigate the move from a Python monolith to a services-oriented architecture while also scaling with user traffic that doubled every six months.</p> <p>This <a href="https://lethain.com/wardley-mapping/">Wardley map</a> explores how orchestration frameworks were evolving during that period to be used as an input into determining the most effective path forward for Uber&rsquo;s Infrastructure Engineering team.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-map">Reading this map</h2> <p>To quickly understand this Wardley Map, read from top to bottom. If you want to review how this map was <em>written</em>, then you should read section by section from the bottom up, starting with Users, then Value Chains, and so on.</p> <p>More detail on this structure in <a href="https://lethain.com/wardley-mapping/">Refining strategy with Wardley Mapping</a>.</p> <h2 id="how-things-work-today">How things work today</h2> <p>There are three primary internal teams involved in service provisioning. The Service Provisioning Team abstracts applications developed by Product Engineering from servers managed by the Server Operations Team. As more servers are added to support application scaling, this is invisible to the applications themselves, freeing Product Engineers to focus on what the company values the most: developing more application functionality.</p> <p><img src="https://lethain.com/static/blog/strategy/wardley-compute-v1.png" alt="Wardley map for service orchestration"></p> <p>The challenges within the current value chain are cost-efficient scaling, reliable deployment, and fast deployment. All three of those problems anchor on the same underlying problem of resource scheduling. We want to make a significant investment into improving our resource scheduling, and believe that understanding the industry&rsquo;s trend for resource scheduling underpins making an effective choice.</p> <h2 id="transition-to-future-state">Transition to future state</h2> <p>Most interesting cluster orchestration problems are anchored in cluster metadata and resource scheduling. Request routing, whether through DNS entries or allocated ports, depends on cluster metadata. Mapping services to a fleet of servers depends on resource scheduling managing cluster metadata. Deployment and autoscaling both depend on cluster metadata.</p> <p><img src="https://lethain.com/static/blog/strategy/wardley-compute-v2.png" alt="Pipeline showing progression of service orchestration over time"></p> <p>This is also an area where we see significant changes occurring in 2014.</p> <p>Uber initially solved this problem using Clusto, an open-source tool released by Digg with goals similar to Hashicorp&rsquo;s <a href="https://www.consul.io/">Consul</a> but with limited adoption. We also used <a href="https://www.puppet.com/">Puppet</a> for configuring servers, alongside custom scripting. This has worked, but has required custom, ongoing support for scheduling. The key question we&rsquo;re confronted with is whether to build our own scheduling algorithms (e.g. <a href="https://en.wikipedia.org/wiki/Bin_packing_problem">bin packing</a>) or adopt a different approach. It seems clear that the industry intends to directly solve this problem via two paths: relying on Cloud providers for orchestration (Amazon Web Services, Google Cloud Platform, etc) and through open-source scheduling frameworks such as Mesos and Kubernetes.</p> <p>Industry peers with more than five years of infrastructure experience are almost unanimously adopting open-source scheduling frameworks to better support their physical infrastructure. This will give them a tool to perform a bridged migration from physical infrastructure to cloud infrastructure.</p> <p>Newer companies with less existing infrastructure are moving directly to the cloud, and avoiding the orchestration problem entirely. The only companies not adopting one of these two approaches are extraordinarily large and complex (think Google or Microsoft) or allergic to making any technical change at all.</p> <p>From this analysis, it&rsquo;s clear that continuing our reliance on Clusto and Puppet is going to be an expensive investment that&rsquo;s not particularly aligned with the industry&rsquo;s evolution.</p> <h2 id="user--value-chains">User &amp; Value Chains</h2> <p>This map focuses on the orchestration ecosystem within a single company, with a focus on what did, and did not, stay the same from roughly 2008 to 2014. It focuses in particular on three users:</p> <ol> <li><strong>Product Engineers</strong> are focused on provisioning new services, and then deploying new versions of that service as they make changes. They are wholly focused on their own service, and entirely unaware of anything beneath the orchestration layer (including any servers).</li> <li><strong>Service Provisioning Team</strong> focuses on provisioning new services, orchestrating resources for those services, and routing traffic to those services. This team acts as the bridge between the Product Engineers and the Server Operations Team.</li> <li><strong>Server Operations Team</strong> is focused on adding server capacity to be used for orchestration. They work closely with the Service Provisioning Team, and have no contact with the Product Engineers.</li> </ol> <p>It&rsquo;s worth acknowledging that, in practice, these are artificial aggregates of multiple underlying teams. For example, routing traffic between services and servers is typically handled by a Traffic or Service Networking team. However, these omissions are intended to clarify the distinctions relevant to the evolution of orchestration tooling.</p>
Making images consistent for book.https://lethain.com/images-consistent-book/Sun, 06 Apr 2025 04:00:00 -0700https://lethain.com/images-consistent-book/ <p><strong>TODO: fix TODOs below</strong></p> <p>After working on diversifying strategies I linked as examples in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>, the next problem I wanted to start working on was consistent visual appearances across all images included in the book. There are quite a few images, so I wanted to started by creating a tool to make a static HTML page of all included images, to facilitate reviewing all the images at once.</p> <p>To write the script, I decided to write a short prompt describing the prompt, followed by paste in <a href="https://lethain.com/links-script-book/">the script I&rsquo;d previously for consistent linking</a>, and seeing what I&rsquo;d get.</p> <p><img src="https://lethain.com/static/blog/2025/images-llm-prompt.png" alt=""></p> <p>This worked on the first try, after which I made a few tweaks to include more information. That culminates in <a href="https://gist.github.com/lethain/c0048b1ae95ac01befa18311bd34a6c1">images.py</a> which allowed me to review all images in the book.</p> <p>This screenshot gives a sense of the general problem.</p> <p><img src="https://lethain.com/static/blog/2025/ds-image-style-starting.png" alt="Screenshot of various imagines in my new book that I need to make more visually consistent"></p> <p>Reviewing the full set of images, I identifed two categories of problems. First, I had one model image that was done via Figma instead of Excalidraw, and consequently looked very different.</p> <p><img src="https://lethain.com/static/blog/2025/inconsistent-models-image.png" alt="Inconsistent screenshot example"></p> <p>Then the question was whether to standardize on that style or on the Excalidraw style.</p> <p><img src="https://lethain.com/static/blog/2025/inconsistent-models-image-excal.png" alt="Inconsistent screenshot example"></p> <p>There was only one sequence diagram in Figma style, so ultimately it was the easier choice to make the Figma one follow the Excalidraw style.</p> <p><strong>TODO: add image of updated image using Excalidraw style</strong></p> <p>The second problem was deciding how to represent Wardley maps consistently. My starting point was two very inconsistent varieties of Wardley maps, neither of which was ideal for including in a book.</p> <p>The output from <a href="https://mapkeep.com/">Mapkeep</a>, which is quite good overall but not optimized for printing (too much empty whitespace).</p> <p><img src="https://lethain.com/static/blog/2025/inconsistent-models-image-wardley-1.png" alt="Inconsistent screenshot example"></p> <p>Then I had Figma versions I&rsquo;d made as well.</p> <p><img src="https://lethain.com/static/blog/2025/inconsistent-models-image-wardley-2.png" alt="Inconsistent screenshot example"></p> <p>In the Figma versions that I&rsquo;d made, I <em>had</em> tried to make better use of whitespace, and I think I succeeded. That said, they looked pretty bad altogether. In this case I was pretty unhappy with both options, so I decided to spend some time thinking about it.</p> <p>For inspiration, I decided to review how maps were represented in two printed books. First in Simon Wardley&rsquo;s book.</p> <p><strong>TODO: example from the wardley mapping book and</strong></p> <p>Then in <strong>TODO: remember name&hellip;</strong></p> <p><strong>TODO: example from other mapping book</strong></p> <p>Reflecting on both of those.. <strong>TODO: finish</strong></p> <p><strong>TODO: actually finish making them consistent, lol</strong></p> <p><strong>TODO: conclusion about this somehow</strong></p> <p>Finally, this is another obvious script that I should have written for <em>Staff Engineer</em>. Then again, that is a significantly less image heavy book, so it probably wouldn&rsquo;t have mattered too much.</p>Script for consistent linking within book.https://lethain.com/links-script-book/Sun, 06 Apr 2025 04:00:00 -0700https://lethain.com/links-script-book/ <p>As part of my work on <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>, I&rsquo;ve been editing a bunch of stuff. This morning I wanted to work on two editing problems. First, I wanted to ensure I was referencing strategies evenly across chapters (and not relying too heavily on any given strategy). Second, I wanted to make sure I was making references to other chapters in a consistent, standardized way,</p> <p>Both of these are collecting Markdown links from files, grouping those links by either file or url, and then outputting the grouped content in a useful way. I decided to experiment with writing a one-shot prompt to write the script for me rather than writing it myself. The prompt and output (from ChatGPT 4.5) are <a href="https://gist.github.com/lethain/34187be3090a12b74f4bdaba8f4fd796">available in this gist</a>.</p> <p>That worked correctly! The output was a bit ugly, so I tweaked the output slightly by hand, and also adjusted the regular expression to capture less preceding content, which resulted in <a href="https://gist.github.com/lethain/20ae58ce576670f245920a4ab1993056">this script</a>. Although I did it by hand, I&rsquo;m sure it would have been faster to just ask ChatGPT to fix the script itself, but either way these are very minor tweaks.</p> <p>Now I can call the script in either standard of <code>--grouped</code> mode. Example of <code>./scripts/links.py &quot;content/posts/strategy-book/*.md&quot;</code> output:</p> <p><img src="https://lethain.com/static/blog/2025/links-output-standard.png" alt="Output of script extracting links from chapters and representing them cleanly"></p> <p>Example of <code>./scripts/links.py &quot;content/posts/strategy-book/*.md&quot; --grouped</code> output:</p> <p><img src="https://lethain.com/static/blog/2025/links-output-grouped.png" alt="Second format of output from script extracting links, this time grouping by link instead of file"></p> <p>Altogether, this is a super simple script that I could have written in thirty minutes or so, but this allowed me to write it in less than ten minutes, and get back to actually editing with the remaining twenty.</p> <p>It&rsquo;s also quite helpful for solving the intended problem of imbalanced references to strategies. Here you can see I initially had 17 references to the Uber migration strategy, which was one of the first strategies I documented for the book.</p> <p><img src="https://lethain.com/static/blog/2025/uber-migration-links.png" alt="17 references to the Uber service migration strategy"></p> <p>On the other hand, the strategy for Stripe&rsquo;s Sorbet only had three links because it was one of the last two chapters I finished writing.</p> <p><img src="https://lethain.com/static/blog/2025/stripe-sorbet-links.png" alt="3 references to the Stripe Sorbet strategy"></p> <p>It&rsquo;s natural that I referenced existing strategies more frequently than unwritten strategies over the course of drafting chapters, but it makes the book feel a bit lopsided when read, and this script has helped me address the imbalance. This is something I didn&rsquo;t do in <em>Staff Engineer</em>, but wish I had, as I ended up leaning a bit too heavily on early stories and mentioned later stories less frequently.</p>How to resource Engineering-driven projects at Calm? (2020)https://lethain.com/resourcing-eng-driven-projects/Thu, 03 Apr 2025 05:00:00 -0700https://lethain.com/resourcing-eng-driven-projects/ <p>One of the recurring challenges in any organization is how to split your attention across long-term and short-term problems. Your software might be struggling to scale with ramping user load while also knowing that you have a series of meaningful security vulnerabilities that need to be closed sooner than later. How do you balance across them?</p> <p>These sorts of balance questions occur at every level of an organization. A particularly frequent format is the debate between Product and Engineering about how much time goes towards developing new functionality versus improving what&rsquo;s already been implemented. In 2020, Calm was growing rapidly as we navigated the COVID-19 pandemic, and the team was struggling to make improvements, as they felt saturated by incoming new requests. This strategy for resourcing Engineering-driven projects was our attempt to solve that problem.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reverse order, starting with <em>Explore</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy--operation">Policy &amp; Operation</h2> <p>Our policies for resourcing Engineering-driven projects are:</p> <ul> <li>We will protect one Eng-driven project per product engineering team, per quarter. These projects should represent a maximum of 20% of the team&rsquo;s bandwidth. Each project must advance a measurable metric, and execution must be designed to show progress on that metric within 4 weeks.</li> <li>These projects must adhere to <a href="https://lethain.com/calm-product-eng-company/">Calm&rsquo;s existing Engineering strategies</a>.</li> <li>We resource these projects first in the team&rsquo;s planning, rather than last. However, only concrete projects are resourced. If there are no concrete proposals, then the team won&rsquo;t have time budgeted for Engineering-driven work.</li> <li>Team&rsquo;s engineering manager is responsible for deciding on the project, ensuring the project is valuable, and pushing back on attempts to defund the project.</li> <li>Project selection does not require CTO approval, but you should escalate to the CTO if there&rsquo;s friction or disagreement.</li> <li>CTO will review Engineering-driven projects each quarter to summarize their impact and provide feedback to teams&rsquo; engineering managers on project selection and execution. They will also review teams that did <em>not</em> perform a project to understand why not.</li> </ul> <p>As we&rsquo;ve communicated this strategy, we&rsquo;ve frequently gotten conceptual alignment that this sounds reasonable, coupled with uncertainty about what sort of projects should actually be selected. At some level, this ambiguity is an acknowledgment that we believe teams will identify the best opportunities bottoms-up. However, we also wanted to give two concrete examples of projects we&rsquo;re greenlighting in the first batch:</p> <ul> <li> <p><em>Code-free media release</em>: historically, we&rsquo;ve needed to make a number of pull requests to add, organize, and release new pieces of media. This is high urgency work, but Engineering doesn&rsquo;t exercise much judgment while doing it, and manual steps often create errors. We aim to track and eliminate these pull requests, while also increasing the number of releases that can be facilitated without scaling the content release team.</p> </li> <li> <p><em>Machine-learning content placement</em>: developing new pieces of media is often a multi-week or month process. After content is ready to release, there&rsquo;s generally a debate on where to place the content. This matters for the company, as this drives engagement with our users, but it matters even more to the content creator, who is generally evaluated in terms of their content&rsquo;s performance.</p> <p>This often leads to Product and Engineering getting caught up in debates about how to surface particular pieces of content. This project aims to improve user engagement by surfacing the best content for their interests, while also giving the Content team several explicit positions to highlight content without Product and Engineering involvement.</p> </li> </ul> <p>Although these projects are similar, it&rsquo;s not intended that <em>all</em> Engineering-driven projects are of this variety. Instead it&rsquo;s happenstance based on what the teams view as their biggest opportunities today.</p> <h2 id="diagnosis">Diagnosis</h2> <p>Our assessment of the current situation at Calm is:</p> <ul> <li> <p>We are spending a high percentage of our time on urgent but low engineering value tasks. Most significantly, about one-third of our time is going into launching, debugging, and changing content that we release into our product. Engineering is involved due to implementation limitations, not because our involvement adds inherent value (We mostly just make releases slowly and inadvertently introduce bugs of our own.)</p> </li> <li> <p>We have a bunch of fairly clear ideas around improving the platform to empower the Content team to speed up releases, and to eliminate the Engineering involvement. However, we&rsquo;ve struggled to find time to implement them, or to validate that these ideas will work.</p> </li> <li> <p>If we don&rsquo;t find a way to prioritize, and succeed at implementing, a project to reduce Engineering involvement in Content releases, we will struggle to support our goals to release more content and to develop more product functionality this year</p> </li> <li> <p>Our Infrastructure team has been able to plan and make these kinds of investments stick. However, when we attempt these projects within our Product Engineering teams, things don&rsquo;t go that well. We are good at getting them onto the initial roadmap, but then they get deprioritized due to pressure to complete other projects.</p> </li> <li> <p>Our Engineering team of 20 engineers is not very fungible, largely due to specialization across roles like iOS, Android, Backend, Frontend, Infrastructure, and QA. We would like to staff these kinds of projects onto the Infrastructure team, but in practice that team does not have the product development experience to implement this kind of project.</p> </li> <li> <p>We&rsquo;ve discussed spinning up a Platform team, or moving product engineers onto Infrastructure, but that would either (1) break our goal to maintain joint pairs between Product Managers and Engineering Managers, or (2) be indistinguishable from prioritizing within the existing team because it would still have the same Product Manager and Engineering Manager pair.</p> </li> <li> <p>Company planning is organic, occurring in many discussions and limited structured process. If we make a decision to invest in one project, it&rsquo;s easy for that project to get deprioritized in a side discussion missing context on why the project is important.</p> <p>These reprioritization discussions happen both in executive forums and in team-specific forums. There&rsquo;s imperfect awareness across these two sorts of forums.</p> </li> </ul> <h2 id="explore">Explore</h2> <p>Prioritization is a deep topic with a wide variety of <a href="https://en.wikipedia.org/wiki/Requirement_prioritization">popular solutions</a>. For example, many software companies rely on &ldquo;RICE&rdquo; scoring, calculating priority as (Reach times Impact times Confidence) divided by Effort. At the other extreme are complex methodologies like <a href="https://en.wikipedia.org/wiki/Scaled_agile_framework">Scaled Agile Framework</a>.</p> <p>In addition to generalized planning solutions, many companies carve out special mechanisms to solve for particular prioritization gaps. Google historically offered <a href="https://en.wikipedia.org/wiki/Side_project_time">20% time</a> to allow individuals to work on experimental projects that didn&rsquo;t align directly with top-down priorities. Stripe&rsquo;s Foundation Engineering organization developed the concept of Foundational Initiatives to prioritize cross-pillar projects with long-term implications, which otherwise struggled to get prioritized within the team-led planning process.</p> <p>All these methods have clear examples of succeeding, and equally clear examples of struggling. Where these initiatives have succeeded, they had an engaged executive sponsoring the practice&rsquo;s rollout, including triaging escalations when the rollout inconvenienced supporters of the prior method. Where they lacked a sponsor, or were misaligned with the company&rsquo;s culture, these methods have consistently failed despite the fact that they&rsquo;ve previously succeeded elsewhere.</p>Systems model of API deprecationhttps://lethain.com/api-deprecation-model/Tue, 01 Apr 2025 05:00:00 -0700https://lethain.com/api-deprecation-model/

<p>In <a href="https://lethain.com/api-deprecation-strategy/">How should Stripe deprecate APIs?</a>, the diagnosis depends on the claim that deprecating APIs is a significant cause of customer churn. While there is internal data that can be used to correlate deprecation with churn, it&rsquo;s also valuable to build a model to help us decide if we believe that correlation and causation are aligned in this case.</p> <p>In this chapter, we&rsquo;ll cover:</p> <ol> <li>What we learn from modeling API deprecation&rsquo;s impact on user retention</li> <li>Developing a system model using the <a href="https://github.com/lethain/systems">lethain/systems</a> package on GitHub. That model <a href="https://github.com/lethain/eng-strategy-models/blob/main/APIDeprecationModel.ipynb">is available in the lethain/eng-strategy-models</a> repository</li> <li>Exercising that model to learn from it</li> </ol> <p>Time to investigate whether it&rsquo;s reasonable to believe that API deprecation is a major influence on user retention and churn.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="learnings">Learnings</h2> <p>In an initial model that has 10% baseline for customer churn per round, reducing customers experiencing API deprecation from 50% to 10% per round only increases the steady state of integrated customers by about 5%.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-2.png" alt="Impact of 10% and 50% API deprecation on integrated customers"></p> <p>However, if we eliminate the baseline for customer churn entirely, then we see a massive difference between a 10% and 50% rate of API deprecation.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-4.png" alt="Impact of rates of API deprecation with zero baseline churn"></p> <p>The biggest takeaway from this model is that eliminating API-deprecation churn alone won&rsquo;t significantly increase the number of integrated customers. However, we also can&rsquo;t fully benefit from reducing baseline churn without simultaneously reducing API deprecations. Meaningfully increasing the number of integrated customers requires lowering both sorts of churn in tandem.</p> <h2 id="sketch">Sketch</h2> <p>We&rsquo;ll start by sketching the model&rsquo;s happiest path: potential customers flowing into engaged customers and then becoming integrated customers. This represents a customer who decides to integrate with Stripe&rsquo;s APIs, and successfully completes that integration process.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-simple.png" alt="Happiest path for Stripe API integration"></p> <p>Business would be good if that were the entire problem space. Unfortunately, customers do occasionally churn. This churn is represented in two ways:</p> <ol> <li><code>baseline churn</code> where integrated customers leave Stripe for any number of reasons, including things like dissolution of their company</li> <li><code>experience deprecation</code> followed by <code>deprecation-influenced churn</code>, which represent the scenario where a customer decides to leave after an API they use is deprecated</li> </ol> <p>There is also a flow for <code>reintegration</code>, where a customer impacted by API deprecation can choose to update their integration to comply with the API changes.</p> <p>Pulling things together, the final sketch shows five stocks and six flows.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-full.png" alt="Final version of systems model for API deprecation"></p> <p>You could imagine modeling additional dynamics, such as recovery of churned customers, but it seems unlikely that would significantly influence our understanding of how API deprecation impacts churn.</p> <h2 id="reason">Reason</h2> <p>In terms of acquiring customers, the most important flows are customer acquisition and initial integration with the API. Optimizing those flows will increase the number of existing integrations.</p> <p>The flows driving churn are baseline churn, and the combination of API deprecation and deprecation-influenced churn. It&rsquo;s difficult to move baseline churn for a payments API, as many churning customers leave due to company dissolution. From a revenue-weighted perspective, baseline churn is largely driven by non-technical factors, primarily pricing. In either case, it&rsquo;s challenging to impact this flow without significantly lowering margin.</p> <p>Engineering decisions, on the other hand, have a significant impact on both the number of API deprecations, and on the ease of reintegration after a migration. Because the same work to support reintegration also supports the initial integration experience, that&rsquo;s a promising opportunity for investment.</p> <h2 id="model">Model</h2> <p>You can find the <a href="https://github.com/lethain/eng-strategy-models/blob/main/APIDeprecationModel.ipynb">full implementation of this model on GitHub</a> if you want to see the full model rather than these emphasized snippets.</p> <p>Now that we have identified the most interesting avenues for experimentation, it&rsquo;s time to develop the model to evaluate which flows are most impactful.</p> <p>Our initial model specification is:</p> <pre><code># User Acquisition Flow [PotentialCustomers] &gt; EngagedCustomers @ 100 # Initial Integration Flow EngagedCustomers &gt; IntegratedCustomers @ Leak(0.5) # Baseline Churn Flow IntegratedCustomers &gt; ChurnedCustomers @ Leak(0.1) # Experience Deprecation Flow IntegratedCustomers &gt; DeprecationImpactedCustomers @ Leak(0.5) # Reintegrated Flow DeprecationImpactedCustomers &gt; IntegratedCustomers @ Leak(0.9) # Deprecation-Influenced Churn DeprecationImpactedCustomers &gt; ChurnedCustomers @ Leak(0.1) </code></pre> <p>Whether these are <em>reasonable</em> values depends largely on how we think about the length of each round. If a round was a month, then assuming half of integrated customers would experience an API deprecation would be quite extreme. If we assumed it was a year, then it would still be high, but there are certainly some API providers that routinely deprecate at that rate. (From my personal experience, I can say with confidence that Facebook&rsquo;s Ads API deprecated at least one important field on a quarterly basis in the 2012-2014 period.)</p> <p>Admittedly, for a payments API this would be a high rate, and is intended primarily as a contrast with more reasonable values in the exercise section below.</p> <h2 id="exercise">Exercise</h2> <p>Our goal with exercising this model is to understand how much API deprecation impacts customer churn. We&rsquo;ll start by charting the initial baseline, then move to compare it with a variety of scenarios until we build an intuition for how the lines move.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-1.png" alt="Initial model stabilizing integrated customers around 1,000 customers"></p> <p>The initial chart stabilizes in about forty rounds, maintaining about 1,000 integrated customers and 400 customers dealing with deprecated APIs. Now let&rsquo;s change the experience deprecation flow to impact significantly fewer customers:</p> <pre><code># Initial setting with 50% experiencing deprecation per round IntegratedCustomers &gt; DeprecationImpactedCustomers @ Leak(0.5) # Less deprecation, only 10% experiencing per round IntegratedCustomers &gt; DeprecationImpactedCustomers @ Leak(0.1) </code></pre> <p>After those changes, we can compare the two scenarios.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-2.png" alt="Impact of 10% and 50% API deprecation on integrated customers"></p> <p>Lowering the deprecation rate significantly reduces the number of companies dealing with deprecations at any given time, but it has a relatively small impact on increasing the steady state for integrated customers. This must mean that another flow is significantly impacting the size of the integrated customers stock.</p> <p>Since there&rsquo;s only one other flow impacting that stock, baseline churn, that&rsquo;s the one to exercise next. Let&rsquo;s set the baseline churn flow to zero to compare that with the initial model:</p> <pre><code># Initial Baseline Churn Flow IntegratedCustomers &gt; ChurnedCustomers @ Leak(0.1) # Zeroed out Baseline Churn Flow IntegratedCustomers &gt; ChurnedCustomers @ Leak(0.0) </code></pre> <p>These results make a compelling case that baseline churn is dominating the impact of deprecation. With no baseline churn, the number of integrated customers stabilizes at around 1,750, as opposed to around 1,000 for the initial model.</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-3.png" alt="Impact of eliminating baseline churn from model"></p> <p>Next, let&rsquo;s compare two scenarios without baseline churn, where one has high API deprecation (50%) and the other has low API deprecation (10%).</p> <p><img src="https://lethain.com/static/blog/strategy/api-deprecation-model-4.png" alt="Impact of rates of API deprecation with zero baseline churn"></p> <p>In the case of two scenarios without baseline churn, we can see having an API deprecation rate of 10% leads to about 6,000 integrated customers, as opposed to 1,750 for a 50% rate of API deprecation. More importantly, in the 10% scenario, the integrated customers line shows no sign of flattening, and continues to grow over time rather than stabilizing.</p> <p>The takeaway here is that significantly reducing either baseline churn or API deprecation magnifies the benefits of reducing the other. These results also reinforce the value of treating churn reduction as a system-level optimization, not merely a collection of discrete improvements.</p>
Is this strategy any good?https://lethain.com/is-this-strategy-any-good/Thu, 27 Mar 2025 05:00:00 -0700https://lethain.com/is-this-strategy-any-good/ <p>We&rsquo;ve read a lot of strategy at this point in the book. We can judge a strategy&rsquo;s format, and its construction: both are useful things. However, format is a predictor of quality, not quality itself. The remaining question is, how should we assess whether a strategy is any good?</p> <p><a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration strategy</a> unlocked the entire organization to make rapid progress. It also led to a sprawling architecture problem down the line. Was it a great strategy or a terrible one? Folks can reasonably disagree, but it&rsquo;s worthwhile developing our point of view on why we should prefer one interpretation or the other.</p> <p>This chapter will focus on:</p> <ul> <li>The various ways that are frequently suggested for evaluating strategies, such as input-only evaluation, output-only evaluation, and so on</li> <li>A rubric for evaluating strategies, and why a useful rubric has to recognize that strategies have to be evaluated in phases rather than as a unified construct</li> <li>Why ending a strategy is often a sign of a good strategist, and sometimes the natural reaction to a new phase in a strategy, rather than a judgment on prior phases</li> <li>How missing context is an unpierceable veil for evaluating other companies' strategies with high-conviction, and why you&rsquo;ll end up attempting to evaluate them anyway</li> <li>Why you can learn just as much from bad strategies as from good ones, even in circumstances where you are missing much of the underlying context</li> </ul> <p>Time to refine our judgment about strategy quality a bit.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="how-are-strategies-graded">How are strategies graded?</h2> <p>Before suggesting my own rubric, I want to explore how the industry appears to grade strategies in practice. That&rsquo;s not because I particularly agree with them&ndash;I generally find each approach misses an important nuance&ndash;understanding their flaws is a foundation to build on.</p> <p>Grading strategy on its outputs is by far the most prevalent approach I&rsquo;ve found in industry. This is an appealing approach, because it does make sense that a strategy&rsquo;s results are more important than anything else. However, this line of thinking can go awry. We saw massive companies like Google move to service architectures, and we copied them because if it worked for Google, it would likely work for us. As discussed in the <a href="https://lethain.com/decompose-monolith-strategy/">monolith decomposition strategy</a>, it did not work particularly well for most adopters.</p> <p>The challenge with grading outputs is that it doesn&rsquo;t distinguish between &ldquo;alpha&rdquo;, how much better your results are because of your strategy, and &ldquo;beta&rdquo;, the expected outcome if you hadn&rsquo;t used the strategy. For example, the <a href="https://lethain.com/pos-acquisition-integration/">acquisition of Index</a> allowed Stripe to build a point-of-sale business line, but they were also on track to internally build that business. Looking <em>only</em> at outputs can&rsquo;t distinguish whether it would have been better to build the business via acquisition or internally. But one of those paths must have been the better strategy.</p> <p>Similarly, there are also strategies that succeed, but do so at unreasonably high costs. <a href="https://lethain.com/api-deprecation-strategy/">Stripe&rsquo;s API deprecation strategy</a> is a good example of a strategy that was <em>extremely</em> well worth the cost for the company&rsquo;s first decade, but eventually became too expensive to maintain as the evolving regulatory environment created more overhead. Fortunately, Stripe modified their strategy to allow some deprecations, but you can imagine an alternate scenario where they attempted to maintain their original strategy, which would have likely failed due to its accumulating costs.</p> <p>Confronting these problems with judging on outputs, it&rsquo;s compelling to switch to the opposite lens and evaluate strategy purely on its inputs. In that approach, as long as the sum of the strategy&rsquo;s parts make sense, it&rsquo;s a good strategy, even if it didn&rsquo;t accomplish its goals. This approach is very appealing, because it appears to focus <em>purely</em> on the strategy&rsquo;s alpha.</p> <p>Unfortunately I find this view similarly deficient. For example, the <a href="https://lethain.com/llm-adoption-strategy/">strategy for adopting LLMs</a> offers a cautious approach to adopting LLMs. If that company is outcompeted by competitors in the incorporation of LLMs, to the loss of significant revenue, I would argue that strategy isn&rsquo;t a great one, even if it&rsquo;s rooted in a proper diagnosis and effective policies. Doing good strategy requires reconciling the theoretical with the practical, so we can&rsquo;t argue that inputs alone are enough to evaluate strategy work. If a strategy is conceptually sound, but struggling to make an impact, then its authors should continue to <a href="https://lethain.com/refining-eng-strategy/">refine it</a>. If its authors take a single pass and ignore subsequent information that it&rsquo;s not working, then it&rsquo;s a failed strategy, regardless of how thoughtful the first pass was.</p> <p>While I find these mechanisms to be incomplete, they&rsquo;re still instructive. By incorporating bits of each of these observations, we&rsquo;re surprisingly close to a rubric that avoids each of these particular downfalls.</p> <h2 id="rubric-for-strategy">Rubric for strategy</h2> <p>Balancing the strengths and flaws of the previous section&rsquo;s ideas, the rubric I&rsquo;ve found effective for evaluating strategy is:</p> <ol> <li><strong>How quickly is the strategy refined?</strong> If a strategy starts out bad, but improves quickly, that&rsquo;s a better strategy than a mostly right strategy that never evolves. Strategy thrives when its practitioners understand it is a living endeavor.</li> <li><strong>How expensive is the strategy&rsquo;s refinement for implementing and impacted teams?</strong> Just as culture eats strategy for breakfast, good policy loses to poor operational mechanisms every time. Especially early on, good strategy is validated cheaply. Expensive strategies are discarded before they can be validated, let alone improved.</li> <li><strong>How well does the current iteration solve its diagnosis?</strong> Ultimately, strategy does have to address the diagnosis it starts from. Even if you&rsquo;re learning quickly and at a low cost, at some point you do have to actually get to impact. Strategy must eventually be graded on its impact.</li> </ol> <p>With this rubric in hand, we can finally assess the <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration strategy</a>. It refined rapidly as we improved our tooling, minimized costs because we had to rely on voluntary adoption, and solved its diagnosis extremely well. So this was a great strategy, but how do we think about the fact that its diagnosis missed out on the consequences of a wide-spread service architecture on developer productivity?</p> <p>This brings me to the final component of the strategy quality rubric: the recognition that strategy exists across multiple phases. Each phase is defined by new information&ndash;whether or not this information is known by the strategy&rsquo;s authors&ndash;that render the diagnosis incomplete.</p> <p>The Uber strategy can be thought of as existing across two phases:</p> <ul> <li>Phase 1 used service provisioning to address developer productivity challenges in the monolith.</li> <li>Phase 2 was engaging with consequences of a sprawling service architecture.</li> </ul> <p>All the good grades I gave the strategy are appropriate to the first phase. However, the second phase was ushered in by the negative impacts to developer productivity exposed by the initial rollout. The second phase&rsquo;s grades on the rate of iteration, the cost, and the outcomes were reasonable, but a bit lower than first phase. In the subsequent years, the second phase was succeeded by a third phase that aimed to address the second&rsquo;s challenges.</p> <h2 id="does-stopping-mean-a-strategys-bad">Does stopping mean a strategy&rsquo;s bad?</h2> <p>Now that we have a rubric, we can use it to evaluate one of the important questions of strategy: does giving up on a strategy mean that the strategy is a bad one?</p> <p>The vocabulary of strategy phases helps us here, and I think it&rsquo;s uncontroversial to say that a new phase&rsquo;s evolution of your prior diagnosis might make it appropriate to abandon a strategy. For example, Digg owned our own servers in 2010, but would certainly <em>not</em> buy their own servers if they started ten years later. Circumstances change.</p> <p>Sometimes I also think that aborting a strategy in its first phase is a good sign. That&rsquo;s generally true when the rate of learning is outpaced by the cost of learning. I recently sponsored a developer productivity strategy that had some impact, but less than we&rsquo;d intended. We immortalized a few of the smaller pieces, and returned further exploration to a <a href="https://lethain.com/when-write-down-engineering-strategy/">lower altitude strategy</a> owned by the teams rather than the high altitude strategy that I owned as an executive.</p> <p>Essentially all strategies are competing with strategies at other altitudes, so I think giving up on strategies, especially high altitude strategies, is almost always a good idea.</p> <h2 id="the-unpierceable-veil">The unpierceable veil</h2> <p>Working within our industry, we are often called upon to evaluate strategies from afar. As other companies rolled out LLMs in their products or microservices for their architectures, our companies pushed us on why we weren&rsquo;t making these changes as well. The <a href="https://lethain.com/exploring-for-strategy/">exploration step</a> of strategy helps determine where a strategy might be useful for you, but even that doesn&rsquo;t really help you evaluate whether the strategy or the strategists were effective.</p> <p>There are simply too many dimensions of the rubric that you cannot evaluate when you&rsquo;re far away. For example, how many phases occurred before the idea that became the external representation of the strategy came into existence? How much did those early stages cost to implement? Is the <em>real</em> mastery in the operational mechanisms that are never reported on? Did the external representation of the strategy ever happen at all, or is it the logical next phase that solves the reality of the internal implementation?</p> <p>With all that in mind, I find that it&rsquo;s generally impossible to accurately evaluate strategies happening in other companies with much conviction. Even if you want to, the missing context is an impenetrable veil. That&rsquo;s not to say that you shouldn&rsquo;t try to evaluate their strategies, that&rsquo;s something that you&rsquo;ll be forced to do in your own strategy work. Instead, it&rsquo;s a reminder to keep a low confidence score in those appraisals: you&rsquo;re guaranteed to be missing something.</p> <h2 id="learning-despite-quality-issues">Learning despite quality issues</h2> <p>Although I believe it&rsquo;s quite valuable for us to judge the quality of strategies, I want to caution against going a step further and making the conclusion that you can&rsquo;t learn from poor strategies. As long as you are aware of a strategy&rsquo;s quality, I believe you can learn just as much from failed strategies as from great strategy.</p> <p>Part of this is because often even failed strategies have early phases that work extremely well. Another part is because strategies tend to fail for interesting reasons. I learned just as much from Stripe&rsquo;s failed rollout of agile, which struggled due to missing operational mechanisms, as I did from Calm&rsquo;s successful transition to focus primarily on product engineering. Without a clear point of view on which of these worked, you&rsquo;d be at risk of learning the wrong lessons, but with forewarning you don&rsquo;t run that risk.</p> <p>Once you&rsquo;ve determined a strategy was unsuccessful, I find it particularly valuable to determine the strategy&rsquo;s phases and understand which phase and where in the <a href="https://lethain.com/components-of-eng-strategy/">strategy steps</a> things went wrong. Was it a lack of operational mechanisms? Was the policy itself a poor match for the diagnosis? Was the diagnosis willfully ignorant of a truculent executive? Answering these questions will teach you more about strategy than only studying successful strategies, because you&rsquo;ll develop an intuition for which parts truly matter.</p> <h2 id="summary">Summary</h2> <p>Finishing this chapter, you now have a structured rubric for evaluating a strategy, moving beyond &ldquo;good strategy&rdquo; and &ldquo;bad strategy&rdquo; to a nuanced assessment. This assessment is not just useful for grading strategy, but makes it possible to specifically improve your strategy work.</p> <p>Maybe your approach is sound, but your operational mechanisms are too costly for the rate of learning they facilitate. Maybe you&rsquo;ve treated strategy as a single iteration exercise, rather than recognizing that even excellent strategy goes stale over time. Keep those ideas in mind as we head into the final chapter on <a href="https://lethain.com/how-to-get-better-at-strategy/">how you personally can get better at strategy work</a>.</p>Steps to build an engineering strategy.https://lethain.com/components-of-eng-strategy/Thu, 27 Mar 2025 04:00:00 -0700https://lethain.com/components-of-eng-strategy/ <p>Often you&rsquo;ll see a disorganized collection of ideas labeled as a &ldquo;strategy.&rdquo; Even when they&rsquo;re dense with ideas, such documents can be hard to parse, and are a major reason why most engineers will claim their company doesn&rsquo;t have a clear strategy even though in my experience, <em>all</em> companies follow some strategy, even if it&rsquo;s undocumented.</p> <p>This chapter lays out a repeatable, structured approach to drafting strategy. It introduces each step of that approach, which are then detailed further in their respective chapters. Here we&rsquo;ll cover:</p> <ul> <li>How these five steps fit together to facilitate creating strategy, especially by preventing practitioners from skipping steps that feel awkward or challenging.</li> <li>Step 1: Exploring the wider industry&rsquo;s ideas and practices around the strategy you&rsquo;re working on. Exploration is understanding what recent research might change your approach, and how the state of the art has changed since you last tackled a similar problem.</li> <li>Step 2: Diagnosing the details of your problem. It&rsquo;s hard to slow down to understand your problem clearly before attempting to solve it, but it&rsquo;s even more difficult to solve anything well without a clear diagnosis.</li> <li>Step 3: Refinement is taking a raw, unproven set of ideas and testing them against reality. Three techniques are introduced to support this validation process: strategy testing, systems modeling, and Wardley mapping.</li> <li>Step 4: Policy makes the tradeoffs and decisions to solve your diagnosis. These can range from specifying how software is architected, to how pull requests are reviewed, to how headcount is allocated within an organization.</li> <li>Step 5: Operations are the concrete mechanisms that translate policy into an active force within your organization. These can be nudges that remind you about code changes without associated tests, or weekly meetings where you study progress on a migration.</li> <li>Whether these steps are sacred or are open to adaptation and experimentation, including when you personally should persevere in attempting steps that don&rsquo;t feel effective.</li> </ul> <p>From this chapter&rsquo;s starting point, you&rsquo;ll have a high-level summary of each step in strategy creation, and can decide where you want to read further.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="how-the-steps-become-strategy">How the steps become strategy</h2> <p>Creating effective strategy is not the rote incantation of a formula. You can’t merely follow these steps to guarantee that you&rsquo;ll create a great strategy. However, what I’ve consistently found is that strategies fail more often due to avoidable errors than from fundamentally unsound thinking. Busy people skip steps. Especially steps they dislike or have failed at before.</p> <p>These steps are the scaffolding to avoid those errors. By practicing routinely, you&rsquo;ll build powerful habits and intuition around which approach is most appropriate for the current strategy you&rsquo;re working on. They also help turn strategy into a community practice that you, your colleagues, and the wider engineering ecosystem can participate in together.</p> <p>Each step is an input that flows into the next step. Your exploration is the foundation of a solid diagnosis. Your diagnosis helps you search the infinite space of policy for what you currently need. Operational mechanisms help you turn policy into an active force supporting your strategy rather than an abstract treatise.</p> <p>If you&rsquo;re skeptical of the steps, you should certainly maintain your skepticism, but do give them a few tries before discarding them entirely. You may also appreciate the discussion in the chapter on <a href="https://lethain.com/bridging-eng-strategy-theory-and-practice/">bridging between theory and practice when doing strategy</a>.</p> <h2 id="explore">Explore</h2> <p>Exploration is the deliberate practice of searching through a strategy’s problem and solution spaces before allowing yourself to commit to a given approach. It&rsquo;s understanding how other companies and teams have approached similar questions, and whether their approaches might also work well for you. It&rsquo;s also learning why what brought you so much success at your former employer isn&rsquo;t necessarily the best solution for your current organization.</p> <p>The <a href="https://lethain.com/uber-service-migration-strategy/">Uber service migration strategy</a> used exploration to understand the service ecosystem by reading industry literature:</p> <blockquote> <p>As a starting point, we find it valuable to read <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf">Large-scale cluster management at Google with Borg</a> which informed some elements of the approach to Kubernetes, and <a href="https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf">Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center</a> which describes the Mesos/Aurora approach.</p></blockquote> <p>It also used a <a href="https://lethain.com/wardley-mapping/">Wardley map</a> to explore the cloud compute ecosystem.</p> <p><img src="https://lethain.com/static/blog/strategy/wardley-compute-v2.png" alt="Evolution of service orchestration in 2014"></p> <p>For more detail, read the <a href="https://lethain.com/exploring-for-strategy/">Exploration chapter</a>.</p> <h2 id="diagnose">Diagnose</h2> <p>Diagnosis is your attempt to correctly recognize the context that the strategy needs to solve before deciding on the policies to address that context. Starting from your exploration&rsquo;s learnings, and your understanding of your current circumstances, building a diagnosis forces you to delay thinking about solutions until you fully understand your problem&rsquo;s nuances.</p> <p>A diagnosis can be largely data driven, such as the <a href="https://lethain.com/private-equity-strategy/">navigating a Private Equity ownership transition strategy</a>:</p> <blockquote> <p>Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%).</p></blockquote> <p>It can also be less data driven, instead aiming to summarize a problem, such as the <a href="https://lethain.com/pos-acquisition-integration/">Index acquisition strategy</a>&rsquo;s summary of the known and unknown elements of the technical integration prior to the acquisition closing:</p> <blockquote> <p>We will need to rapidly integrate the acquired startup to meet this timeline. We only know a small number of details about what this will entail. We do know that point-of-sale devices directly operate on payment details (e.g. the point-of-sale device knows the credit card details of the card it reads).</p> <p>Our compliance obligations restrict such activity to our “tokenization environment”, a highly secured and isolated environment with direct access to payment details. This environment converts payment details into a unique token that other environments can utilize to operate against payment details without the compliance overhead of having direct access to the underlying payment details.</p></blockquote> <p>The approach, and challenges, of developing a diagnosis are detailed in the <a href="https://lethain.com/diagnosis-for-strategy/">Diagnosis chapter</a>.</p> <h2 id="refine-test-map--model">Refine (Test, Map &amp; Model)</h2> <p>Strategy refinement is a toolkit of methods to identify which parts of your diagnosis are most important, and verify that your approach to solving the diagnosis actually works. This chapter delves into the details of using three methods in particular: <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a>, <a href="https://lethain.com/strategy-systems-modeling/">systems modeling</a>, and <a href="https://lethain.com/wardley-mapping/">Wardley mapping</a>.</p> <p><img src="https://lethain.com/static/blog/strategy/QualityMentalModels.png" alt="Requests succeeding and failing between a user, load balancer, and server"></p> <p class="tc"><em>An example of a systems modeling diagram.</em></p> <p>These techniques are also demonstrated in the strategy case studies, such as the <a href="https://lethain.com/wardley-llm-ecosystem/">Wardley map of the LLM ecosystem</a>, or the <a href="https://lethain.com/engineering-cost-model/">systems model of backfilling roles without downleveling them</a>.</p> <p>For more detail, read the <a href="https://lethain.com/refining-eng-strategy/">Refinement chapter</a>.</p> <div class="bg-light-gray br4 ph3 pv1"> <h3 id="why-isnt-refinement-earlier-or-later">Why isn&rsquo;t refinement earlier (or later)?</h3> <p>A frequent point of disagreement is that refinement should occur before the diagnosis. Another is that mapping and modeling are two distinct steps, and mapping should occur before diagnosis, and modeling should occur after policy. A third is that refinement ought to be the final step of strategy, turning the steps into a looping cycle. These are all reasonable observations, so let me unpack my rationale for this structure.</p> <p>By <em>far</em> the biggest risk for most strategies is not that you model too early, or map too late, but instead that you simply skip both steps entirely. My foremost concern is minimizing the required investment into mapping and modeling such that more folks do these steps at all. Refining after exploring and diagnosing allows you to concentrate your efforts on a smaller number of load-bearing areas.</p> <p>That said, it&rsquo;s common to refine many places in your strategy creation. You&rsquo;re just as likely to have three small refinement steps as one bigger one.</p> </div> <h2 id="policy">Policy</h2> <p>Policy is interpreting your diagnosis into a concrete plan. This plan also needs to work, which requires careful study of what&rsquo;s worked within your company, and what new ideas you&rsquo;ve discovered while exploring the current problem.</p> <p>Policies can range from providing directional guidance, such as the <a href="https://lethain.com/user-data-access-strategy/">user data controls strategy</a>&rsquo;s guidance:</p> <blockquote> <p><strong>Good security discussions don’t frame decisions as a compromise between security and usability.</strong> We will pursue multi-dimensional tradeoffs to simultaneously improve security and efficiency. Whenever we frame a discussion on trading off between security and utility, it’s a sign that we are having the wrong discussion, and that we should rethink our approach.</p> <p>We will prioritize mechanisms that can both automatically authorize and automatically document the rationale for accesses to customer data. The most obvious example of this is automatically granting access to a customer support agent for users who have an open support ticket assigned to that agent. (And removing that access when that ticket is reassigned or resolved.)</p></blockquote> <p>To committing not to make a decision until later, as practiced in the <a href="https://lethain.com/pos-acquisition-integration/">Index acquisition strategy</a>:</p> <blockquote> <p>Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months.</p> <p>We will take up this discussion after launching the initial release.</p></blockquote> <p>This chapter further goes into evaluating policies, overcoming ambiguous circumstances that make it difficult to decide on an approach, and developing novel policies.</p> <p>For full detail, read the <a href="https://lethain.com/policy-for-strategy/">Policy chapter</a>.</p> <h2 id="operations">Operations</h2> <p>Even the best policies have to be interpreted. There will be new circumstances their authors never imagined, and the policies may be in effect long after their authors have left the organization. Operational mechanisms are the concrete implementation of your policy.</p> <p>The simplest mechanisms are an explicit escalation path, as shown in <a href="https://lethain.com/calm-product-eng-company/">Calm&rsquo;s product engineering strategy</a>:</p> <blockquote> <p>Exceptions are granted by the CTO, and must be in writing. The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product.</p></blockquote> <p>From that starting point, the mechanisms can get far more complex. This chapter works through evaluating mechanisms, composing an operational plan, and the most common sorts of operational mechanisms that I&rsquo;ve seen across strategies.</p> <p>For more detail, read the <a href="https://lethain.com/operations-for-strategy/">Operations chapter</a>.</p> <h2 id="is-the-structure-sacrosanct">Is the structure sacrosanct?</h2> <p>When someone&rsquo;s struggling to write a strategy document, one of the first tools someone will often recommend is a strategy template. Templates are great: they reduce the ambiguity in an already broad project into something more tractable. If you&rsquo;re wondering if you should use a template to craft strategy: sure, go ahead!</p> <p>However, I find that well-meaning, thoughtful templates often turn into lumbering, callous documents that serve no one well. The secret to good templates is that someone has to own it, and that person has to care about the template writer first and foremost, rather than the various constituencies that want to insert requirements into the strategy creation process. The security, compliance and cost of your plans matter a great deal, but many organizations start to layer in more and more requirements into these sorts of documents until the idea of writing them becomes prohibitively painful.</p> <p>The best advice I can give someone attempting to write strategy, is that you should discard every element of strategy that gets in your way <em>as long as</em> you can explain what that element was intended to accomplish. For example, if you&rsquo;re drafting a strategy and you don&rsquo;t find any operational mechanisms that fit. That&rsquo;s fine, discard that section. Ultimately, the structure is not sacrosanct, it&rsquo;s the thinking behind the sections that really matter.</p> <p>This topic is explored in more detail in the chapter on <a href="https://lethain.com/readable-engineering-strategy-documents/">Making engineering strategies more readable</a>.</p> <h2 id="summary">Summary</h2> <p>Now, you know the foundational steps to conducting strategy. From here, you can dive into the details with the strategy case studies like <a href="https://lethain.com/llm-adoption-strategy/">How should you adopt LLMs?</a> or you can maintain a high altitude starting with how <a href="https://lethain.com/exploring-for-strategy/">exploration creates the foundation for an effective strategy</a>.</p> <p>Whichever you start with, I encourage you to eventually work through both to get the full perspective.</p>Operational mechanisms for strategy.https://lethain.com/operations-for-strategy/Thu, 20 Mar 2025 04:00:00 -0700https://lethain.com/operations-for-strategy/

<p>Even the best policies fail if they aren&rsquo;t adopted by the teams they&rsquo;re intended to serve. Can we persistently change our company&rsquo;s behaviors with a one-time announcement? No, probably not.</p> <p>I refer to the art of making policies work as &ldquo;operations&rdquo; or &ldquo;strategy operations.&rdquo; The good news is that effectively operating a policy is two-thirds avoiding common practices that simply don&rsquo;t work. The other one-third takes some repetition, but can be practiced in any engineering role: there&rsquo;s no need to wait until you&rsquo;re an executive to start building mastery.</p> <p>This chapter will dig into those mechanisms, with particular focus on:</p> <ul> <li>How policies are supported by operations, and how operations are composed of mechanisms that ensure they work well</li> <li>Evaluating operational mechanisms to select between different options, and determine which mechanisms are unlikely to be an effective choice</li> <li>Composing an operational plan for the specific set of policies that you are looking to support</li> <li>Common varieties of effective mechanisms such as approval forums, inspection mechanisms, nudges, and so on. We&rsquo;ll also explore the sorts of mechanisms that tend to work poorly</li> <li>How to adjust your approach to operations if you are in an engineering role rather than an executive role</li> <li>How cargo-culting remains the largest threat to effective strategy operations</li> </ul> <p>Let&rsquo;s unpack the details of turning your <em>potentially</em> good policy into an impactful policy.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="what-are-operational-mechanisms">What are operational mechanisms?</h2> <p>Operations are how a policy is implemented and reinforced. Effective operations ensure that your policies actually accomplish something. They can range from a recurring weekly meeting, to an alert that notifies the team when a threshold is exceeded, to a promotion rubric requiring a certain behavior to be promoted.</p> <p>In the strategy for <a href="https://lethain.com/private-equity-strategy/">working with new private equity ownership</a>, we introduce a policy to backfill hires at a lower level, and also limit the maximum number of principal engineers:</p> <blockquote> <p><strong>We will move to an “N-1” backfill policy</strong>, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit, with any exceptions approved in writing by the CTO–this applies for both promotions and external hires.</p></blockquote> <p>That introduces an explicit operational mechanism of escalations going to the CTO, but it also introduces an implicit and undefined mechanism: how do we ensure the backfills are actually down-leveled as the policy instructs? It might be a group chat with engineering recruiting where the CTO approves the level of backfilled roles. Instead, it might be the responsibility of recruiting to enforce that downleveling. In a third approach, it might be taken on trust that hiring managers will do the right thing. Each of those three scenarios is a potential operational solution to implementing this policy. Operations is picking the right one for your circumstances, and then tweaking it as you learn from running it.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Operations in government</strong></p> <p>For another interesting take on how critical operations are, <em><a href="https://www.recodingamerica.us/">Recoding America</a></em> by Jennifer Pahlka is well worth the read. It explores how well-intended government legislation often isn&rsquo;t implementable, which results in policies that require massive IT investments but provide little benefit to constituents.</p> </div> <h2 id="how-to-evaluate-mechanisms">How to evaluate mechanisms</h2> <p>In order to determine the most effective operational mechanisms for the problems you&rsquo;re working on, it&rsquo;s useful to have a standardized rubric for evaluating mechanisms. While this rubric isn&rsquo;t perfectly universal&ndash;customize it for your needs&ndash;having any rubric will make it easier to evaluate your options consistently.</p> <p>The rubric I use to evaluate whether an operational mechanism will be effective is:</p> <ol> <li><strong>Measurability</strong>: Can you measure both leading and lagging indicators to <a href="https://lethain.com/inspection/">inspect</a> the mechanism&rsquo;s impact? If you have to choose between the two, measuring leading indicators allows much quicker evaluation and iteration on your mechanisms.</li> <li><strong>Adoption cost</strong>: How much work will <a href="https://lethain.com/migrations/">migrating</a> to this mechanism require? Can this work be done incrementally or does it require a major, coordinated shift?</li> <li><strong>User ease (or burden)</strong>: After adopting this policy, how much easier (or harder) will it be for users to perform their work? If things will be harder, are those users able to tolerate the additional time?</li> <li><strong>Provider ease (or burden)</strong>: How much additional ongoing maintenance will this mechanism require from the centralized or platform team providing it? For example, if every new architecture proposal requires a thorough review by your Security team, does the Security team have the actual ability to support those reviews?</li> <li><strong>Reliance on authority</strong>: How much does this mechanism depend on a top-down authority&rsquo;s active support? If the sponsoring executive departs, will this mechanism remain effective? Is that an effective tradeoff in this case?</li> <li><strong>Culturally aligned</strong>: Is this something that your organization is going to do, or something that they are going to fight against each step? Is there a way you can adjust the framing to make it more acceptable to your organization&rsquo;s culture?</li> </ol> <p>Generally, I find folks are good at evaluating mechanisms against these criteria, but somewhat worse at accepting the consequences of their evaluation. For example, falling in love with a particular mechanism and then trying to force the organization to accept a mechanism whose adoption cost is unbearably high, or introduce a mechanism that creates significant user burden onto a team that is already struggling with tight efficiency goals like a customer support team.</p> <p>Self-awareness helps here, but so does consulting others to point out the errors in your reasoning, which is a core part of how I&rsquo;ve found success in adopting operational mechanisms.</p> <h2 id="composing-an-operational-plan">Composing an operational plan</h2> <p>Your operational plan is the sum of the mechanisms used to support your policies. While evaluating each individual mechanism in isolation is part of creating an operations plan, it&rsquo;s also valuable to consider how the mechanisms will work together:</p> <ol> <li> <p><strong>Review the policies you&rsquo;ve developed.</strong> What sort of mechanisms seem most likely to support these policies? How might these mechanisms be pooled together to avoid redundancy?</p> </li> <li> <p><strong>Review the operational mechanisms that have worked in your organization.</strong> What mechanisms have been used to best effect, and which have left a sufficiently bad taste in the organization&rsquo;s collective memory that they&rsquo;ll be hard to reuse effectively?</p> </li> <li> <p><strong>Which new mechanisms showed up in your <a href="https://lethain.com/exploring-for-strategy/">exploration</a>?</strong> In your exploration phase, you&rsquo;ll frequently encounter mechanisms that your organization hasn&rsquo;t previously tried. If any of them seem particularly well-suited to the policies you&rsquo;re considering, and none of your organization&rsquo;s frequently used mechanisms are good fits, then consider testing a new one.</p> </li> <li> <p><strong>Evaluate mechanisms against the evaluation rubric.</strong> For each of the mechanisms you&rsquo;re considering using, apply the rubric from the above <em>How to evaluate mechanisms</em> to validate they&rsquo;re good fits.</p> </li> <li> <p><strong>Consolidate into an operational plan.</strong> Now that you&rsquo;ve determined the mechanisms you want to consider, work on fitting the full set of mechanisms into one coherent plan. Be particularly mindful of the ease, or burden, the integrated plan creates for both your users and platform providers.</p> </li> <li> <p><strong>Validate plan with users and providers.</strong> Many plans make sense from afar, but fail due to imposing an unreasonable burden. Or the burden might be acceptable, but the actual workflow simply won&rsquo;t work at all.</p> </li> <li> <p><strong>Consider validating via <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a>.</strong> If you run the above process, and can&rsquo;t come to an agreement with stakeholders on your proposed plan, then simply commit to running a strategy testing process including the plan. This will create space for everyone to build confidence in the approach before they feel forced to make a commitment to following it long-term.</p> <p>Even if you don&rsquo;t use strategy testing for your plan, at least commit to scheduling a review in three months reflecting on how things have worked out.</p> </li> </ol> <p>Your operational plan is the vehicle that delivers your policies to your organization. It&rsquo;s extremely tempting to skip refining the details here, but it&rsquo;s a relatively quick step and will completely change your strategy&rsquo;s outcomes.</p> <h2 id="common-mechanisms">Common mechanisms</h2> <p>Most companies have a handful of frequently used operational mechanisms. Some of those mechanisms are company specific, such as <a href="https://forum.commoncog.com/t/the-amazon-weekly-business-review-commoncog/1958">Amazon&rsquo;s weekly business review</a>, and others repeat across companies like requiring executive approval. Across the many mechanisms you&rsquo;ll encounter, you can generally cluster them into recurring categories. This section covers the mechanisms I&rsquo;ve found consistently effective.</p> <h3 id="approval-and-advice-forums">Approval and advice forums</h3> <p>At a high level, new policies are obvious, simple and apply cleanly to the problem they are intended to solve. However, when you apply those policies to detailed, complex circumstances, it&rsquo;s often ambiguous how to stay loyal to a policy&rsquo;s intentions. Approval and advice forums are a common solution to that problem.</p> <p><a href="https://lethain.com/calm-product-eng-company/">Calm&rsquo;s product engineering strategy</a> shows what the simplest, and most common, approval forum looks like in practice:</p> <blockquote> <p><strong>Exceptions are granted by the CTO, and must be in writing.</strong> The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product.</p> <p>All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there’s no written record that the CTO approved it, then it’s not approved.</p></blockquote> <p>This example also has several weaknesses that happen in many approval forums. Most importantly, it doesn&rsquo;t make it clear how to get approvals. It would be stronger if it explicitly explained how to get an approval (perhaps go ask in <code>#cto-approvals</code>), and where to find prior approvals to help someone considering requesting an exception to calibrate their request.</p> <p>Approvals don&rsquo;t necessarily need to come from senior leadership. Instead, the senior leadership can loan their authority on a topic to another group. The <a href="https://lethain.com/llm-adoption-strategy/">LLM adoption strategy</a> provides a good example of this:</p> <blockquote> <p>Start with Anthropic. We use Anthropic models, which are available through our existing cloud provider via AWS Bedrock. To avoid maintaining multiple implementations, where we view the underlying foundational model quality to be somewhat undifferentiated, we are not looking to adopt a broad set of LLMs at this point. This is anchored in our Wardley map of the LLM ecosystem.</p> <p>Exceptions will be reviewed by the Machine Learning Review in #ml-review</p></blockquote> <p>In a more community-minded organization, the approval forums might not require senior leadership involvement at all. Instead, the culture might create an environment where the forums&rsquo; feedback is taken seriously on its own merits.</p> <p>Every company does approval forums a bit differently, ranging from our experiments at <a href="https://lethain.com/navigators/">Carta with Navigators</a>, granting executive authority for technical decisions to named engineers in each area, to Andrew Harmel-Law&rsquo;s discussion of this topic in <em><a href="https://www.amazon.com/Facilitating-Software-Architecture-Empowering-Architectural-ebook/dp/B0DMHGWCPN/">Facilitating Software Architecture</a></em>. You can spend a lot of time arguing the details here, my experience is that having the right participants and a good executive sponsor matter a lot, and the other pieces matter a lot less.</p> <h3 id="inspection">Inspection</h3> <p>While even the best policies can fail, the more common scenario is that a policy will sort-of work, and need some modest adjustments to make it more successful. An <a href="https://lethain.com/inspection/">inspect</a> mechanism allows you to evaluate whether your policy is succeeding and if you need to make adjustments.</p> <p>The <a href="https://lethain.com/user-data-access-strategy/">user-data access strategy</a> provides an example:</p> <blockquote> <p><strong>Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale.</strong> This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues’ internal tools. If we only expand requirements for accessing customer data, we won’t view this as progress because it’s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won’t represent this as progress, because we won’t have increased the number of supported requests.</p> <p>As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data.</p></blockquote> <p>This example is a good start, but fully realizing an inspection mechanism requires concretely specifying where and how the data will be tracked. A better version of this would include a link to the dashboard you&rsquo;ll look at, and a commitment to reviewing the data on a certain frequency.</p> <p>For a recent inspection mechanism, I created a recurring invite with a link to the relevant data dashboard, and a specific chat channel for discussion, and invited the working group who had agreed to review the data on that cadence. This wasn&rsquo;t a synchronous meeting, but rather a commitment to independently review, and discuss anything that felt surprising.</p> <p>Your particular mechanisms could be threshold-triggered alerts, something you fold into an existing metrics review meeting, a script you commit to running and reviewing periodically, or something else. The most important thing is that it cannot silently fail.</p> <h3 id="nudges">Nudges</h3> <p>While it&rsquo;s common to hear complaints about how a team isn&rsquo;t following a new policy, as if it were a deliberate choice they&rsquo;d made, I find it more common that people want to do things the new way, but rarely take time to learn how to do it. Nudges are providing individuals with context to inform them about a better way they might do something, and they are an exceptionally effective mechanism.</p> <p>Grounding this in an example, at Stripe we had a policy of allowing teams to self-authorize introducing new cloud hosting costs. This worked well almost all the time. However, sometimes teams would accidentally introduce large cost increases without realizing it, and teams that introduced those spikes almost never had any awareness that they had caused the problem. Even if we&rsquo;d told them they must not introduce unapproved spending spikes, they simply didn&rsquo;t perceive they&rsquo;d done it.</p> <p>We had the choice between preventing all teams from introducing new spend, or we could try using a nudge. The nudge we added informed teams when their cloud spend accelerated month over month, directed to charts that explained the acceleration, and told them where to go to ask questions. Nudges pair well with inspections, and there was also a monthly review by the Efficiency Engineering team to review any spikes and reach out where necessary.</p> <p>Maybe we could have forced all teams to review new spend, but this nudge approach didn&rsquo;t require an authoritative mandate to implement. It also meant we only spent time advising teams that <em>actually</em> spent too much, instead of having to discuss with every team that <em>might</em> spend too much.</p> <p>As another example making that point, a working group at Carta added a nudge to inform managers of untested pull requests merged by their team. Some managers had previously said they simply didn&rsquo;t know when and why their team had merged untested pull requests, and this nudge made it easy to detect. The nudge also respected their attention by not sending a notification at all if there wasn&rsquo;t a new, untested pull request.</p> <p>With poor ergonomics, nudges can be an overwhelming assault on your colleagues attention, but done well, I continue to believe they are the most effective operational mechanism.</p> <h3 id="documentation">Documentation</h3> <p>Policies can&rsquo;t be enforced by people who don&rsquo;t know they exist, or by people who don&rsquo;t know how to follow those policies. In my experience, nudges are the most effective way of solving both of those problems, because nudges bring information to people at exactly the moment that information would be useful. At most companies, well-done nudges are relatively uncommon, and the far more common solution to lack of information is documentation and training.</p> <p>There are so many approaches to both of these topics, and I&rsquo;ve not found my own approaches here particularly effective. Consequently, I am hesitant to give much advice on what will work best for you. The best I can offer is that following standard practices for your company, even if the outcomes seem imperfect, is probably your best bet. Internal knowledge bases tend to rot quickly, and introducing yet another knowledge base is almost always the illusion of progress rather than real progress. Even when you really don&rsquo;t like the current one.</p> <p>Finally, remember that success for documentation and training is not necessarily that everyone in the company knows how a new policy works. Instead, as discussed in <a href="https://lethain.com/is-engineering-strategy-useful/">the chapter on whether strategy is useful</a>, a more useful goal is informational herd immunity: as long as someone on each team understands your policy, the team will generally be capable of following it.</p> <h3 id="automation">Automation</h3> <p>Relying on humans to respond is slow, and the quality of human response is highly varied. In many cases, automation provides the most effective and most scalable mechanism to support your policies&rsquo; rollout.</p> <p>Automation was key in the <a href="https://lethain.com/uber-service-migration-strategy/">Uber service migration strategy</a>, moving us out of a manual, slow process that was taking up a great deal of user and provider time:</p> <blockquote> <p>Move to structured requests, and out of tickets. Missing or incorrect information in provisioning requests create significant delays in provisioning. Further, collecting this information is the first step of moving to a self-service process. As such, we can get paid twice by reducing errors in manual provisioning while also creating the interface for self-service workflows.</p></blockquote> <p>In that case, better automation allowed us to eliminate a series of back-and-forth negotiations to collect data, and to instead get the necessary information in a single step. Occasionally we still ran into users who couldn&rsquo;t fill in the form, but now we could focus on providing a good manual experience for those rare exceptions.</p> <p>As you use automation as a core strategy mechanism, it&rsquo;s important to recognize that designing an effective user experience is a prerequisite to automation having a positive impact. If you view the user experience of your automation as a secondary concern, then you are unlikely to make much impact with automation.</p> <h3 id="deferment-to-future-work">Deferment to future work</h3> <p>Sometimes there&rsquo;s something you really want a policy to do, but you also know that you have no reasonable mechanism to do it. In that case, you may find explicitly deferring action on the topic useful.</p> <p>The strategy for <a href="https://lethain.com/pos-acquisition-integration/">integration of the Index acquisition at Stripe</a> uses this mechanism:</p> <blockquote> <p>Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months.</p> <p>We will take up this discussion after launching the initial release.</p></blockquote> <p>As did the strategy for <a href="https://lethain.com/private-equity-strategy/">working with a private equity acquirer</a>:</p> <blockquote> <p>We believe there are significant opportunities to reduce R&amp;D maintenance investments, but we don’t have conviction about which particular efforts we should prioritize. We will kickoff a working group to identify the features with the highest support load.</p></blockquote> <p>There&rsquo;s no shame in deferral. As much as you want to make progress on a certain area, it&rsquo;s better to explicitly acknowledge that you can&rsquo;t make progress on it&ndash;and clarify when you will be able to&ndash;then to allow the organization to churn on an intractable problem.</p> <h3 id="meetings">Meetings</h3> <p>Meetings are the final mechanism, and you can fit any and all of the above mechanisms into a meeting. They are a universal mechanism, although frequently overused because they can do an adequate job of operating almost any policy.</p> <p>The most common mechanism is a reporting meeting, such as reporting progress in the Executive Weekly Meeting as <a href="https://lethain.com/llm-adoption-strategy/">suggested in the LLM adoption strategy</a>:</p> <blockquote> <p><strong>Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets.</strong> Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers.</p> <p>Report on progress monthly in Exec Weekly Meeting, coordinated in #exec-weekly</p></blockquote> <p>The other common meeting archetype is the <a href="https://lethain.com/testing-strategy-iterative-refinement/">weekly working meeting</a> introduced in the chapter on strategy testing. Meetings are almost always the most expensive mechanism you can find to solve a problem, but they are easy to suggest, run, and iterate on.</p> <p>If you can&rsquo;t find any other mechanism you believe in, then a meeting is a decent starting point. Just don&rsquo;t get too fond of them, and try to iterate your way to canceling every meeting that you start.</p> <h2 id="anti-patterns">Anti-patterns</h2> <p>In addition to the effective operational methods discussed above, there are a number of additional mechanisms that are frequently used, but which I consider anti-patterns. They can provide some value, but there&rsquo;s almost always a better alternative.</p> <ol> <li> <p><strong>Top-down pronouncements</strong>: Sometimes a policy will be operationalized by simply declaring it must be followed. It&rsquo;s common to see a leader declare that a policy is now in effect, assuming that the announcement is a useful way to implement the new policy.</p> <p>For example, some &ldquo;return to office&rdquo; policies dictate that the team must work from their office, but driving a real change requires motivating those individuals to actually return.</p> </li> <li> <p><strong>Education-as-announcements rollouts</strong>: The default way that many companies roll out policies is through one-time &ldquo;education,&rdquo; often as an all-company announcement for existing employees. They might follow up by updating training for onboarding new-hires. Education sounds great, but a couple of trainings will never change organizational behavior.</p> <p>Changing behavior requires ongoing reminders, visible role models, inspection to understand why some teams are <em>not</em> adopting the behavior, and so on. Education can be a good component of operationalizing a policy, but it cannot stand on its own.</p> </li> <li> <p><strong>Mandatory recurring trainings:</strong> These are a staple of compliance driven policies, generally because of laws which require providing a certain number of hours of relevant training each year.</p> <p>There are two deep challenges with mandatory trainings. First, because attendance is <em>required</em>, people tend to make little effort to make the content good. Second, many folks don&rsquo;t pay attention because they expect the content to be low quality. It&rsquo;s not uncommon to hear people say that they&rsquo;ve never heard of a policy that they&rsquo;ve performed annual training on for multiple years.</p> <p>It&rsquo;s possible to overcome these barriers, but in a situation where you&rsquo;re accountable for changing outcomes, as opposed to shifting legal obligations away from the company, these tend to work poorly.</p> </li> <li> <p><strong>Just change the culture.</strong> Some leaders frame most problems as cultural problems, which is a reasonable frame: most things can be usefully viewed as a cultural problem. Unfortunately, it&rsquo;s common for those who rely heavily on the cultural frame to also have a simplistic view about how culture is changed.</p> <p>Changing an organization&rsquo;s culture is tricky, and requires a combination of many techniques to create visible leaders role modeling the new behavior, and reinforcement mechanisms to ensure pockets of dissent are weeded out. Anyone who frames culture change as a simple or instant change is living in an imaginary world.</p> </li> </ol> <p>If you&rsquo;re using one of these approaches, it isn&rsquo;t necessarily a bad choice. Instead, you should just make sure you can explain why you&rsquo;re using it, and then you need to also make sure you believe that explanation. If you don&rsquo;t, look for a mechanism from the earlier</p> <h2 id="what-if-youre-not-an-executive">What if you&rsquo;re not an executive?</h2> <p>It&rsquo;s easy to get discouraged when you think about which operational mechanisms are available to you as a non-executive. So many of the frequently seen mechanisms like running mandatory recurring meetings, or a binding architecture review process are not accessible to you.</p> <p>That is true: they&rsquo;re not accessible to you. However, there&rsquo;s always a related mechanism that can be implemented with less authority. The binding architecture process can be replaced with an architectural advice process. The mandatory review of pull requests can be replaced with a nudge.</p> <p>Although it may be more common to see the authoritative mechanisms in the companies you work in, my experience working as an executive is that these authoritative mechanisms don&rsquo;t work particularly well. They do a great job of technically shifting accountability to the wider organization, but they often don&rsquo;t change behavior at all. So, instead of getting frustrated by what you can&rsquo;t do, focus instead on the mechanisms that are available to you today. Add nudges, focus on the real dynamics of how colleagues do work in your organization, and build a real dataset.</p> <p>It&rsquo;s very hard to get an executive to support your initiative before the mechanisms and data exist to support it, and very easy to get their support once they do. Once you&rsquo;ve done what you can without authority to build confidence, if you really do need more authority, then you&rsquo;re in a good place to escalate to get an executive to support your policies.</p> <h2 id="beware-cargo-culting">Beware cargo-culting</h2> <p>The longer that I am in the industry, the more I am surprised by how few strategists seem to care if their approach actually works. Instead, they seem focused on doing something that <em>might</em> work, offloading accountability to either the organization or some team, and then moving off to the next problem.</p> <p>Perhaps this is driven by an unfortunate reality that leaders are often evaluated by how they appear, rather than by what they accomplish. Whether or not that&rsquo;s the underlying reason for why it happens, it does make it surprisingly difficult to know which patterns to borrow from strategy rollouts and implementations.</p> <p>The best advice, unfortunately, is to remain skeptically optimistic. Collect ideas widely, but force the ideas to prove their merit.</p> <h2 id="summary">Summary</h2> <p>Now that you&rsquo;ve finished this chapter, you&rsquo;re significantly more qualified to write a complete, useful strategy than I was a decade into my career. Often skipped, the operations behind your strategy are at least as essential as any other step, and any strategy without them will fade quietly into your organization&rsquo;s history.</p> <p>In addition to being able to rollout a strategy of your own, this chapter also provides a useful rescue toolkit you can use to put an existing, floundering strategy back on track. If you don&rsquo;t see an opportunity to write new strategy within your organization, then there&rsquo;s still probably room to flex your operational skill.</p>