Irrational Exuberancehttps://lethain.com/Recent content on Irrational ExuberanceHugo -- gohugo.ioen-usWill LarsonSat, 16 Nov 2024 07:00:00 -0700How to get more headcount.https://lethain.com/how-to-get-more-headcount/Sat, 16 Nov 2024 07:00:00 -0700https://lethain.com/how-to-get-more-headcount/<p>One of the recurring challenges that teams face is getting headcount to support their initiatives. A similar problem is the idea that a team can&rsquo;t get a favored project into their roadmap. In both cases, teams often create a story about how clueless executives don&rsquo;t understand why their work is important.</p> <p>I understand why dumb executives are such an appealing explanation to problems: it fits perfectly into the <a href="https://en.wikipedia.org/wiki/Karpman_drama_triangle">Karpman drama triangle</a> by making executives the villian and the team the victim, but I generally find that these sorts of misalignments are the result of basic communication challenges rather than something more exciting.</p> <p>When there&rsquo;s significant misalignment between a team and an executive, my experience is that it often manifests in discussion about a particular project, but it&rsquo;s often rooted in a much broader topic rather than whatever is currently being discussed. Because the disagreement is about the larger topic, there&rsquo;s no way to resolve it while discussing the narrow project at hand, and teams struggle to make progress because <a href="https://lethain.com/layers-of-context/">they&rsquo;re arguing on the wrong layer of context</a>.</p> <p>To solve disagreement, the general hierarchy of alignment is:</p> <ol> <li> <p><strong>Do we agree on the problem to be solved?</strong></p> <p><em>e.g. We&rsquo;re having too many incidents and it&rsquo;s impacting user perception and developer productivity.</em></p> </li> <li> <p><strong>Do we agree on the general approach to solving that problem?</strong></p> <p><em>e.g. We&rsquo;re increaing end-to-end coverage and rejecting PRs that reduce coverage.</em></p> </li> <li> <p><strong>What evidence do we have that the team is executing well today?</strong></p> <p><em>e.g. Here are metrics on: how many user-impacting incidents we&rsquo;ve had over time, how we&rsquo;ve increased end-to-end coverage over time, and developer survey feedback on incidents from the last three quarterly internal surveys.</em></p> </li> <li> <p><strong>Alignment on the particular topic at hand: headcount, roadmapping, prioritization of a specific project, and so on.</strong></p> <p><em>e.g. To speed up further work on this, we&rsquo;re requesting two more engineers.</em></p> </li> </ol> <p>If you are misaligned on any of the first three topics, addressing the fourth project is folly. For example, if the executive believes that your team is not executing your current project well, then they won&rsquo;t believe that giving you more headcount is useful, because you&rsquo;re already screwing up. To convince them to approve a headcount request, you need to first find evidence that your team is doing good work today.</p> <p>Similarly, if the executive doesn&rsquo;t agree with you on your problem or general approach, your headcount request is dead in the water. This is one of the reasons that I see bottoms-up &ldquo;team mission&rdquo; initiatives fail so frequently. Teams define their mission, and then tell executives what they&rsquo;ve decided to focus on, but that&rsquo;s a general misunderstanding of why teams exist: teams exist to solve a company need, not to solve the problem that the team itself wants to solve. When teams lean on their self-selected problem or approach to defend to an executive why they won&rsquo;t do something, they dig in on a foundational misalignment that prevents addressing more nuanced discussions like project prioritization.</p> <p>The solution here is obvious, always make sure you agree on the problem and general solution, and provide evidence the team is working well. These can be an appendix of a document or appendix slides, and should take little to no time to prepare as the first two are core decisions for your team, and the later is a set of metrics or plans that you should already be maintaining as part of operating your team.</p> <p>If you refuse to engage on the first three topics, and skip to the fourth topic directly in aligning with an executive, then you are generally falling back to relying on social dynamics and executives&rsquo; general view of your prior work&ndash;what some might call politics&ndash;rather than having a joint problem solving session together. If you&rsquo;re complaining about politics, and not taking the time to answer the first three, then perhaps you are inadvertantly contributing to the political environment that you dislike.</p> <p>As always when discussing <a href="https://lethain.com/extract-the-kernel/">challenges communicating with executives</a>, it&rsquo;s true that executives should get better at explaining where they&rsquo;re confused or struggling with the rationale. However, it&rsquo;s a lot more useful to simply get better at this yourself than to spend time bemoaning how executives could, at a universal group, improve their communication. I know a lot of people who improved their company or moved into more senior roles by improving their own communication. I know zero folks who did either of those by complaining that executives are bad communicators, although they certainly weren&rsquo;t wrong about it!</p>Navigating Private Equity ownership.https://lethain.com/private-equity-strategy/Mon, 11 Nov 2024 06:00:00 -0700https://lethain.com/private-equity-strategy/<p>In 2020, you could credibly argue that <a href="https://www.readmargins.com/p/zirp-explains-the-world">ZIRP explains the world</a>, but that&rsquo;s an impossible argument to make in 2024 when zero-interest rate policy is only a fond memory. Instead, we&rsquo;re seeing a number of companies designed for rapid expansion learning to adapt to a world that expects immediate free cash flow rather than accepting the sweet promise of discounted future cash flow.</p> <p>This chapter wants to tackle that problem head-on, taking the role of an engineering organization attempting to navigate new ownership by a private equity group. It&rsquo;s an increasingly frequent scenario: after many years of learning to operate under the direction of its original founders, and the brief excitement of going public, now there&rsquo;s a short runway to change operating models. Let&rsquo;s call this company Fungible Ecommerce Company. It&rsquo;s a platform for supporting online commerce, and this is their Engineering Leadership team&rsquo;s attempt to think through their options while waiting for new ownership to provide concrete guideposts.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reserve order, starting with <em>Explore</em>, then <em>Diagnose</em> and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, <em>Operation</em> has been folded into <em>Policy</em>; second, <em>Refine</em> has been embedded in <em>Diagnose</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy">Policy</h2> <p>Our policy for managing our new ownership structure is:</p> <ul> <li> <p>We believe our new ownership will provide a specific target for Research and Development (R&amp;D) operating expenses during the upcoming financial year planning. <strong>We will revise these policies again once we have explicit targets</strong>, and will delay planning around reductions until we have those numbers to avoid running two overlapping processes.</p> <p>That said, looking at our R&amp;D investment relative to comparably growing peer set, we believe that we&rsquo;ll get pressure to moderately reduce our spend. We aim to accomplish that reduction through a series of policies and one-off infrastructure projects, without requiring a major reduction in headcount spend.</p> </li> <li> <p><strong>We will move to an &ldquo;N-1&rdquo; backfill policy</strong>, where departures are backfilled with a less senior level. <strong>We will also institute a strict maximum of one Principal Engineer per business unit</strong>, with any exceptions approved in writing by the CTO&ndash;this applies for both promotions and external hires. These policies are effective immediately, and is based on our <a href="https://lethain.com/engineering-cost-model/">model of engineering-org seniority-mix</a>.</p> <p>We commit to this policy reducing headcount costs by approximately 5% YoY every year for the forseeable future.</p> </li> <li> <p>We evaluated a number of potential changes to our geographical hiring strategy, but we believe that staffing engineers with cross-functional partners (Product, Marketing, Sales, and so on) is a priority. We have not been able to reach an agreement cross-functionally, and as such <strong>we are not changing our geographical hiring strategy at this time</strong>.</p> <p>If we can agree on a policy here, we could accomplish 10-20% reduction in cost over 2-3 years, but the details matter a great deal, so we cannot commit to a specific outcome until we get more cross-functional alignment.</p> </li> <li> <p>Our infrastructure spend has grown significantly more slowly than revenue for the past two years, meaning that we&rsquo;ve successfully implement our infrastructure spend strategy of <a href="https://infraeng.dev/efficiency/">growing infrastructure costs more slowly than revenue</a>. <strong>We will continue our current infrastructure efficiency strategy</strong>, and believe they are relatively few high impact efficiency opportunities at this point.</p> <p>We commit to growing infrastructure spend at no more than 5% YoY, significantly lower than our projected revenue increase of 25% YoY.</p> </li> <li> <p>There are two narrow infrastructure spend opportunities, both related to the integration of prior acquisitions into our shared infrastructure and away from one-off approaches. <strong>We will prioritize the post-acquisition integration work next quarter</strong>, with the goal of fully standarding all infrastructure across the company into the stack maintained by our centralized Infrastructure Engineering team.</p> <p>We commit to a one-time reduction in infrastructure of 3% YoY.</p> </li> <li> <p>We believe there are significant opportunities to reduce R&amp;D maintenance investments, but we don&rsquo;t have conviction about which particular efforts we should prioritize. <strong>We will kickoff a working group to identify the features with the highest support load.</strong></p> </li> </ul> <h2 id="diagnose">Diagnose</h2> <p>We&rsquo;ve diagnosed Fungible Ecommerce Company&rsquo;s current state as:</p> <ul> <li> <p>Fungible Ecommerce Company&rsquo;s revenue has grown 20-25% YoY for the past two years, and our target for next year is 25% YoY revenue growth. While this is not a guarantee, we grew slower than 25% last year, it&rsquo;s a defensible goal that we have a good chance of achieving.</p> </li> <li> <p>Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%).</p> </li> <li> <p>Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&amp;D headcount costs through a reduction. However, we don&rsquo;t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction.</p> </li> <li> <p>Infrastructure engineering spend (including vendors) has grown by 4-5% YoY for the past three years. We made a significant push on reducing costs three years ago, and have grown significantly slower than revenue since then.</p> <p>There are few remaining opportunities to significantly reduce infrastructure costs, but we&rsquo;ve made several acquisitions since our prior infrastructure consolidation, that represent significant potential savings: roughly one-time 1.5% YoY reductions for each of two largest opportunities.</p> </li> <li> <p>A significant portion of our current R&amp;D spend goes into maintaining our existing functionality, particularly functionality related to earier geo-expansion efforts that only apply narrows to some small markets. We suspect there&rsquo;s an opportunity to reduce maintenance overhead here.</p> <p>However, we lack believable metrics on both (1) time spent maintaining the software and (2) time that would be saved by these cleanup efforts. As a result, it&rsquo;s hard to pitch projects of this sort as revenue saving with much conviction.</p> </li> </ul> <h2 id="explore">Explore</h2> <p>Financial markets evaluate companies in comparison to their peers. This is most obvious in public markets, where there&rsquo;s significant information transparency about business performance, and sufficient liquidity to allow markets to revalue companies in something approaching real-time. While private equity firms generally take controlling interest of private businesses, or with the intent of taking the business private if it happens to be public, they value businesses in the same way.</p> <p>In this exploration, we&rsquo;re going to dig into two particular questions. First, we&rsquo;re going to dig into a dataset on the performance of public technology companies, and then second we&rsquo;re going to look into the concrete example of Zendesk, <a href="https://www.reuters.com/markets/deals/zendesk-goes-private-10-bln-deal-2022-11-22/">who were taken private in 2022</a> after being bought by two private equity firms.</p> <h3 id="comparable-companies">Comparable companies</h3> <p>Exploring the benchmarking question first, most investors evaluate engineering within the context of the overall Research &amp; Development (R&amp;D) investment. They generally judge that spend by constructing a scatterplot of R&amp;D spend versus year-over-year revenue growth for a cohort of similar companies. Perfectly similar companies don&rsquo;t exist, so this cohort is generally constructed from companies in similar industries, with similar revenue, and operating in the same regions.</p> <p>We have reached out to our investors to see if they can provide the internal datasets they use for this analysis, but in the mean-time we&rsquo;ve developed a directionally useful dataset using the <a href="https://iri.jrc.ec.europa.eu/scoreboard/2023-eu-industrial-rd-investment-scoreboard">2023 R&amp;D Investment Scoreboard</a>, with some <a href="https://docs.google.com/spreadsheets/d/1IwO3XWDd1inVXLBw4FhkaQh5OuUlYQf0NsX95nPiOtA/edit?gid=943277176#gid=943277176">rough cutting of the data</a> to remove outliers. (If we repeat this process, we will use the <a href="https://www.sec.gov/search-filings">SEC&rsquo;s EDGAR database</a> to pull a more specifically helpful dataset, but this has been a useful starting point.)</p> <p><img src="https://lethain.com/static/blog/strategy/rd-opincome-2022.png" alt="Scatterplot of R&amp;D investment versus operating profit growth at public companies."></p> <p>This isn&rsquo;t a perfect dataset, we prefer revenue growth over rgowth in operating profit, but it&rsquo;s the best option within the dataset that we were able to quickly pull down. Nonetheless, there&rsquo;s a clear strong performer quadrant in top-left that we can plot ourselves into to understand our general performance, which is discussed further in the diagnosis section above.</p> <h3 id="zendesk">Zendesk</h3> <p>The second topic of exploration we dug into is understanding the general sequence of steps taken by private equity ownership after taking ownership of a company. For an example with available public documentation, we focused on <a href="https://www.reuters.com/markets/deals/zendesk-goes-private-10-bln-deal-2022-11-22/">the purchase of Zendesk in 2022</a>.</p> <p>To start, we pulled Zendesk&rsquo;s <a href="https://www.sec.gov/ix?doc=/Archives/edgar/data/0001463172/000146317222000236/zen-20220630.htm">final 10-Q before going private</a>.</p> <p><img src="https://lethain.com/static/blog/strategy/zendesk-pl-2022.png" alt="Zendesk&rsquo;s P&amp;L from their 2022 10-Q"></p> <p>Taking those values, we can reformat them into a chart focusing on the year-over-year changes in the 6 months period ending in 2022 versus the same period in 2021.</p> <p><img src="https://lethain.com/static/blog/strategy/zendesk-yoy-6m-2022.png" alt="Zendesk&rsquo;s P&amp;L from their 2022 10-Q, reformatted to show year-over-year changes"></p> <p>The changes are a bit concerning. Sales and Marketing costs have grown more slowly than revenue, which is positive, but Research and Development (R&amp;D) expenses have grown about 50% faster than revenue, and General and Administration (G&amp;A) charges have grown more than twice as quickly as revenue.</p> <p>From those growth rates, we would assume that the new ownership might push to aggressively reduce spend in those two areas, which is indeed what history suggests happend, with a <a href="https://www.zendesk.com/newsroom/articles/company-announcement/">November, 2022 reduction</a>, followed some months later by a <a href="https://www.zendesk.com/newsroom/articles/zendesk-workforce-reduction/">May, 2023 reduction</a>. It&rsquo;s hard to get precise data here, but it&rsquo;s our impression that these reductions focused on areas where expenses were growing quickly, with particular focus on G&amp;A functions.</p>Using systems modeling to refine strategy.https://lethain.com/strategy-systems-modeling/Mon, 04 Nov 2024 07:00:00 -0700https://lethain.com/strategy-systems-modeling/<p>While I was probably late to learn the concept of <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a>, I might have learned about systems modeling too early in my career, stumbling on Donella Meadows&rsquo; <em><a href="https://www.amazon.com/Thinking-Systems-Donella-H-Meadows-ebook/dp/B005VSRFEA/">Thinking in Systems: A Primer</a></em> before I began my career in software. Over the years, I&rsquo;ve discovered a number of ways to miuse systems modeling, but it remains the most effective, flexible tool I&rsquo;ve found to debugging complex problems.</p> <p>In this chapter, we&rsquo;ll work through:</p> <ul> <li>when systems model is a useful technique, and when it&rsquo;s better to rely on other refinement techniques like Wardley mapping or strategy testing instead</li> <li>a two minute primer on the basics of systems modeling, along with resources for those looking for a deeper exploration of the foundational topics</li> <li>a discussion on systems modeling tooling, why there&rsquo;s no perfect systems modeling tool out there, and how I recommend picking the tool that you build proficiency with</li> <li>the steps to build a systems model for a problem you&rsquo;re engaging with</li> <li>how to document your learnings from a systems model to maximize the chance that others will pay attention to it rather than ignoring it due to the unfamiliarity or complexity of the tooling</li> <li>what systems modeling can&rsquo;t do, even if you really want to believe it can</li> </ul> <p>After working through this chapter&rsquo;s overview of systems modeling, you can see the approaches implemented in a number of system models created to refine the strategies throughout this book. The theory of systems modeling is certainly interesting, but hopefully seeing real models in support of concrete engineering strategies will be even more useful.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="when-is-systems-modeling-useful">When is systems modeling useful?</h2> <p>Although <a href="https://lethain.com/refining-eng-strategy/">refinement</a> is an important step of developing any strategy, some refinement techniques work better for any given strategy. Systems modeling is extremely useful in three distinct scenarios:</p> <ol> <li>When you&rsquo;re unsure where leverage points might be in a complex system, modeling allows you to cheaply test which levers might be meaningful. For example, <a href="https://lethain.com/driver-onboarding-model/">modeling onboarding drivers in a ride-sharing app</a> showed that improving onboarding was less important than reengaging departed drivers.</li> <li>When you have significant data to compare against, which allows you to focus in on the places where the real data and your model are in tensions. For example, I was able to <a href="https://lethain.com/productivity-in-the-age-of-hypergrowth/">model the impact of hiring on Uber&rsquo;s engineering productivity</a>, and then compare that with internal data.</li> <li>When stakeholder disagreements are based in their unstated intuitions, models can turns those intuitions into something structured that can be debated more effectively.</li> </ol> <p>In all three categories, modeling makes it possible iterate your thinking much faster than running a live process or technology experiment with your team. I sometimes hear concerns that modeling slows things down, but this is just an issue of familiarity. The more you practice, modeling can be faster than asking for advice fro industry peers. The actual models I&rsquo;ve developed for this book took less than an hour. (With one notable exception: <a href="https://lethain.com/dx-llm-model/">modeling Large Language Models (LLMs) impacts on developer experience</a>, which took much longer because I deliberately used an impractical tool to reveal the importance of good tooling).</p> <p>Additionally, systems modeling will often expose counter-intuitive dimensions to the problem you&rsquo;re working on. For example, the model I mentioned above on LLMs&rsquo; impact on developer experience suggests that effective LLMs might cause us to spend <em>more</em> time writing and testing code (but less fixing issues discovered post-production). This is a bit unexpected, as you might imagine they&rsquo;d reduce testing time, but reducing testing time is only valuable to the extent that issues identified in production remains&ndash;at worst&ndash;constant; if issues found in production increases, then reduced testing time does not contribute to increased productivity.</p> <p>Modeling without praxis, creates unsubstantiated conviction. However, in combination with praxis, I&rsquo;ve encountered few other techniques that can similar accelerate learning.</p> <p>That doesn&rsquo;t mean that it&rsquo;s always the ideal refinement technique. If you already have conviction on the general approach, and want to refine the narrow details, then <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a> is a better option. If you&rsquo;re trying to understand the evolution of a wider ecosystem, then you may prefer Wardley mapping.</p> <h2 id="two-minute-primer">Two minute primer</h2> <p>If you want an exceptional introduction to systems thinking, there&rsquo;s no better place to go than Donella Meadows&rsquo; <a href="https://www.amazon.com/dp/1603580557">Thinking in Systems</a>. If you want a worse, but shorter, introduction, I wrote a short <a href="https://lethain.com/systems-thinking/">Introduction to systems thinking</a> available online and in <em>An Elegant Puzzle</em>.</p> <p>If you want something <em>even shorter</em>, then here&rsquo;s the briefest that I can manage.</p> <p><img src="https://lethain.com/static/blog/strategy/QualityMentalModels.png" alt="Systems model of requests succeeding and failing between a user, load balancer, and server."></p> <p>Accumulations are called <em>stocks</em>. For example, each of the boxes (<code>Requests</code>, <code>Server</code>, etc) in the above diagram is a stock. Changes to stocks are called <code>flows</code>. Every arrow (<code>OK</code>, <code>Error in server</code>, etc) between stocks in the diagram is a a flow.</p> <p>Systems modeling is the practice of using various configurations of stocks and flows to understand circumstances that might otherwise have surprising behavior or are too slow to understand from measurement.</p> <p>For example, we can use the above model to explore the tradeoffs between a load balancer that does and does not cap throughput to a load-sensitive service behind it.</p> <p><img src="https://lethain.com/static/blog/strategy/two-min-primer-chart.png" alt="Chart showing the number of successful and errored requests in two different scenarios."></p> <p>Without a model, you might get into a philosophical debate about how rediculous it is that the downstream server is load-sensitive. With the model, it&rsquo;s immediately obvious that it&rsquo;s worthwhile protecting it, even if it certainly is concerning that it is so sensitive. This is what models do: they create a cheap way to understand reality when fully understanding reality is cumbersome.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>More systems thinking resources</strong></p> <p><em><a href="https://www.amazon.com/Thinking-Systems-Donella-H-Meadows-ebook/dp/B005VSRFEA/">Thinking in Systems: A Primer</a></em> by Donella Meadows</p> <p><em><a href="https://www.amazon.com/Business-Dynamics-Systems-Thinking-Modeling/dp/007238915X">Business Dynamics: Systems Thinking and Modeling for a Complex World</a></em> by John D. Sterman</p> <p><em><a href="https://www.amazon.com/Introduction-Systems-Thinking-Richmond-2004-11-15/dp/B01FGPA45Y/">An Introduction to Systems Thinking</a></em> by Barry Richmond</p> </div> <h2 id="tooling">Tooling</h2> <p>For an idea that&rsquo;s quite intuitive, the tools of systems modeling are a real obstacle to wider adoption. Perhaps a downstream consequence of many early, popular systems modeling tools being quite expensive, the tooling ecosystems for systems modeling has remained fragmented for some time. There also appears to be a mix of complex requirements, patent consolidation, and percieved small market size that&rsquo;s discouraged a modern solutions from consolidating the tooling market.</p> <p>Earlier, I mentioned that system modeling is extremely quick, but that many folks find it a slow, laborous process. Part of that is an issue of practice, but I suspect that the quality of modeling tooling is at least a big a part of the challenge. In the <a href="https://lethain.com/dx-llm-model/">LLMs impact on developer experience model</a>, I go about the steps of building the model in an increasingly messy spreadsheet. This was slow, challenging, and extremely brittle. Even after finishing the model, I couldn&rsquo;t extend it effectively to test new ideas, and I inadvertently introduced a number of bugs into the implementation.</p> <p>Going in the opposite direction, I explored using a handful of tools, such as <a href="https://sagemodeler.concord.org/">Sagemodeler</a> or <a href="https://insightmaker.com/">InsightMaker</a>, which seemed like a potentially simpler toolchains than the one I typically rely on. There are so many of these introductary toolchains for systems modeling, but I generally find that they&rsquo;re either constrained in their capabilities, have a fairly high learning curve, or make it difficult to share your model with others.</p> <p>In the end, I wound up back at the toolchain that I use, which happens to be one that I wrote some years ago,<a href="https://github.com/lethain/systems">lethain/systems</a>. This is far from a perfect toolchain, but I think it&rsquo;s a relatively effective mechanism for demonstrating systems modeling for a few reasons:</p> <ol> <li>quick to create models and iterate on those models</li> <li>easy to share those models with others for inspection and their own exploration</li> <li>relatively low surface area for bugs in your models</li> <li>free, open-source, self-hosted toolchain that integrates well with Jupyter ecosystem for diagramming, modeling and so on</li> </ol> <p>You should absolutely pick <em>any</em> tool that feels right to you, and practice with it until you feel confident quickly modeling scenarios. Afterwards, I wouldn&rsquo;t recommend spending too much time thinking about tools at all: the most important thing is to build models and learn from them quickly, and almost any tool will be sufficient to that goal with some deliberate practice.</p> <h2 id="how-to-model">How to model</h2> <p>Learning to system model takes some practice, so we&rsquo;ll approach the details of learning to model from two directions. First, by documenting a general structure for approaching modeling, and second by providing breadcrumbs to the models developed in this book for deeper exploration of particular modeling ideas.</p> <p>The structure to systems modeling that I find effective is:</p> <ol> <li> <p><strong>Sketch</strong> the stocks and flows on paper or a diagramming application (e.g. <a href="https://excalidraw.com/">Excalidraw</a>, Figma, Whimsical, etc). Use whatever you&rsquo;re comfortable with.</p> </li> <li> <p><strong>Reason</strong> about how you would expect a potential change to shift the flows through the diagram. Which flows do you expect to go up, and which down, and how would that movement help you evaluate whether your strategy is working?</p> </li> <li> <p><strong>Model</strong> the stocks and flows in your spreadsheet tool of choice. Start by modeling the flows from left to right (e.g. the happy path flows). Once you have that fully working, then start modeling the right to left flows (e.g. the exception path flows).</p> <p>See the <a href="https://lethain.com/dx-llm-model/">Modeling impact of LLMs on Developer Experience</a> model for a deep dive into the particulars of creating a model.</p> </li> <li> <p><strong>Exercise</strong> the model by experiment with a number of different starting values and determining which rates really impact the model&rsquo;s values. This is essentially performing <a href="https://www.investopedia.com/terms/s/sensitivityanalysis.asp">sensitivity analysis</a></p> </li> <li> <p><strong>Document</strong> the work done in the above sections into a standalone writeup. You can then link to that writeup from strategies that benefit from a given model&rsquo;s insights. You might link to any <a href="https://lethain.com/components-of-eng-strategy/">section of your strategy</a>, depending on what topic the particular model explores. I recommend decoupling models from specific strategies, as <em>generally</em> the details of any given model are a distraction from understanding a strategy, and it&rsquo;s best to avoid that distraction unless a reader is surprised by the conclusion, in which case, the link out supports drilling into the details.</p> </li> </ol> <p>As always, this is the sequence of steps that I&rsquo;d encourage you to follow, and the sequence that I generally follow, but you should adapt them to solve the particular problems at hand. Over time, my experience is that most of these steps&ndash;excluding documentation&ndash;turn into a single iterative process, and that I document everything after several iterations.</p> <h2 id="breadcrumbs-for-deeper-exploration">Breadcrumbs for deeper exploration</h2> <p>Now that we&rsquo;ve covered the overarching approach to system modeling, here are the breadcrumbs to specific models that go deeper on particular elements:</p> <ul> <li><a href="https://lethain.com/driver-onboarding-model/">Modeling driver onboarding</a> explores how the driver lifecycle at Theoretical Ride Sharing might be improved with LLMs, and introduces using the <a href="https://github.com/lethain/systems">lethain/systems</a> library for modeling</li> <li><a href="https://lethain.com/dx-llm-model/">Modeling impact of LLMs on Developer Experience</a> looks at how LLMs might impact developer experience at Theoretical Ride Sharing, and is demonstrates (the downsides of) modeling with a spreadsheet</li> <li><a href="https://lethain.com/engineering-cost-model/">Modeling engineering backfill strategy</a> studies the financial consequences of various policies for how we backfill departed engineers in an engineering organization, and introduces further <a href="https://github.com/lethain/systems">lethain/systems</a> features</li> </ul> <p>Beyond these models, you can find other systems models that I&rsquo;ve written on my blog&rsquo;s <a href="https://lethain.com/tags/systems-thinking/">systems-thinking category</a>, and there are numerous, great examples in the materials references in the systems modeling primer section above.</p> <h2 id="how-to-document-a-model">How to document a model</h2> <p>Much like <a href="https://lethain.com/readable-engineering-strategy-documents/">documenting strategy is challenging</a>, communicating with models in a professional setting is challenging. The core problems is that there are many distinct groups of model readers. Some will lack familiarity with the tooling you use to develop models. Others will try to refine, or invalidate, your model by digging into the details.</p> <p>I navigate those mismatches by focusing first on the audience who is least likely to dig into the model. I still want to keep all the details handy, ideally in the rawest form possible to allow others to manipulate the model themselves, but it&rsquo;s very much my second goal when documenting a model.</p> <p>From experience, I recommended this order (it&rsquo;s also the order used in the models in this book, so you&rsquo;ll see it in practice a number of times):</p> <ul> <li>start with learning section, with charts showing what model has taught you</li> <li>sketch and explaing the stocks and flows</li> <li>reason about what the sketch itself teaches you</li> <li>explain how you developed the model, with an emphasis on any particularly complex portions</li> <li>exercise the model by testing how changing the flows and stocks leads to different outcomes</li> </ul> <p>If you remember nothing else, your document should reflect the reality that most people don&rsquo;t care how you built the model, and just want the insights. Give them the insights early, and assume no one will trust your model nearly as much as you do. Models are an input into the strategy, never a reliable sole backer for a strategy.</p> <h2 id="what-systems-modeling-isnt">What systems modeling isn&rsquo;t</h2> <p>Although I find systems modeling a uniquely powerful way to accelerate learning, I&rsquo;ve also encountered many practioners who believe that their models <em>are</em> reality rather than <em>reflecting</em> reality. Over time, I&rsquo;ve developed a short list of cautions to help would-be modelers avoid overcommitting to their model&rsquo;s insights:</p> <ol> <li><strong>When your model and reality conflict, reality is always right.</strong> At Stripe, we developed <a href="https://lethain.com/modeling-reliability/">a model to guide our reliability strategy</a>. The model was intuitively quite good, but its real-world results were mixed. Attachment to our early model distracted us (too much time on collecting and classifying data) and we were slow to engage with the most important problems (maximizing impact of scarce mitigation bandwidth, and growing mitigation bandwidth). We’d have been more impactful if we engaged directly with what reality was teaching us rather than looking for reasons to disregard reality’s lessons.</li> <li><strong>Models are immutable, but reality isn’t.</strong> I once joined an organization investing tremendous energy into hiring but nonetheless struggling to hire. Their intuitive model pushed them to spend years investing into top of funnel optimization, and later steered them to improving the closing process. What they weren’t able to detect was that <a href="https://lethain.com/getting-to-yes/">misalignment in interviewer expectations</a> was the largest hurdle in hiring.</li> <li><strong>Every model omits information; some omit critical information.</strong> The service migration at Uber is a great example: modeling clarified that we <em>had</em> to adopt a more aggressive approach to our service migration in order to succeed. Subsequently, we did succeed at the migration, but the model didn&rsquo;t study the consequences of completing the migration, which were a very challenging development environment. The model captured everything my team cared about, as the team responsible for running the migration, but did nothing to evaluate whether the migration was a good idea overall.</li> </ol> <p>In each of those situations, two things are true: the model was extremely valuable, and the model subtly led us astray. We would have been led astray even without a model, so the key thing to remember isn&rsquo;t that models are inherently misleading, instead the risk is being overly confident about your model. A powerful tool to use in tandem with judgment, not a replacement.</p> <h2 id="summary">Summary</h2> <p>Systems modeling isn&rsquo;t prefect. If you&rsquo;ve already determined your strategy and want to refine the details, then strategy testing is probably a better choice. If you&rsquo;re trying to understand the dynamics of an envolving ecosystem, then Wardley mapping is a more appropriate tool.</p> <p>However, if you have the general shape, but lack conviction on how the pieces fit together, systems modeling is a remarkable tool. After this chapter, you know how to select appropriate tooling, and how to use that tooling to model your problem at hand. Next, we&rsquo;ll work through systems modeling <a href="https://lethain.com/tags/systems-thinking/">a handful of detailed problems</a> to provide concrete examples of applying this technique.</p>Eng org seniority-mix model.https://lethain.com/engineering-cost-model/Sun, 27 Oct 2024 04:00:00 -0700https://lethain.com/engineering-cost-model/<p>One of the trademarks of private equity ownership is the expectation that either the company maintains their current margin and grows revenue at 25-30%, or they instead grow slower and increase their free cash flow year over year. In many organizations, engineering costs have a major impact on their free cash flow. There are many costs to reduce, cloud hosting and such, but inevitably part of the discussion is addressing engineering headcount costs directly.</p> <p>One of the largest contributors to engineering headcount costs is your organization&rsquo;s seniority mix: more senior engineers are paid quite a bit more than earlier career engineers. This model looks at how various policies impact an organization&rsquo;s seniority mix.</p> <p>In this chapter, we&rsquo;ll work to:</p> <ol> <li>Summarize this model&rsquo;s learnings about policy impact on seniority mix</li> <li>Sketch the model&rsquo;s stocks and flows</li> <li>Use <a href="https://github.com/lethain/systems">lethain/systems</a> to iteratively build and exercise the full model</li> </ol> <p>Time to start modeling.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="learnings">Learnings</h2> <p>An organization without a &ldquo;backfill at N-1&rdquo; hiring policy, e.g. an organization that hires a SWE2 to replace a departed SWE2, will have an increasingly top-heavy organization over time.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-2.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>However, even introducing the &ldquo;backfill at N-1&rdquo; hiring policy is insufficient, as our representation in senior levels will become far too high, even if we stop hiring externally into our senior-most levels.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-4.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>To fully accomplish our goal of a healthy seniority mix, we must stop hiring at senior-most levels, implement a &ldquo;backfill at N-1&rdquo; policy, and cap the maximum number of individual at the senior-most level.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-5.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>Any collection of lower-powered policies simply will not impact the model&rsquo;s outcome.</p> <h2 id="sketch">Sketch</h2> <p>We&rsquo;ll start by sketching this system in <a href="https://excalidraw.com/">Excalidraw</a>. It&rsquo;s always fine to use whatever tool you prefer, but in general the lack of complexity in simple sketching tools focuses you on iterating on the stocks and flows&ndash;without getting distracted by tuning settings&ndash;much like a designer starting with messy wireframes rather than pixel-perfect designs.</p> <p>We&rsquo;ll start with sketching the junior-most level: SWE1.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-sketch-1.png" alt="Sketch of systems diagram showing eng promotions and backfill model."></p> <p>We hire external candidates to become SWE1s. We have some get promoted to SWE2, some depart, and then backfill those departures with new SWE1s.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-sketch-2.png" alt="Sketch of systems diagram showing eng promotions and backfill model."></p> <p>As we start sketching the full stocks and flows for SWE2, we also introduce the idea of backfilling at the prior level. As we replicate this pattern for two more career levels&ndash;SWE3 and SWE4&ndash;we get the complete model.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-sketch-4.png" alt="Sketch of systems diagram showing eng promotions and backfill model."></p> <p>The final level, SWE4, is simplified relative to the prior levels, as it&rsquo;s no longer possible to get promoted to a further level. We could go further than this, but the model will simply get increasingly burdensome to work with, so let&rsquo;s stop with four levels.</p> <h2 id="reason">Reason</h2> <p>Reviewing the sketched system, a few interesting conclusions come out:</p> <ol> <li>If promotion rates at any level exceed the rate of hiring at that level plus rate of N-1 backfill at that level, then the proportion of engineers at that level will grow over time</li> <li>If you are not hiring much, then this problem simplifies to promotion rate versus departure rate. A company that does little hiring and has high retention cannot afford to promote frequently. Promotion into senior roles will become financially restrained, even if the policy is explained by some other mechanism</li> <li>Many companies use the &ldquo;<a href="https://lethain.com/career-levels-and-more/">career level</a>&rdquo; policy as the mechanism to identify a level where promotions <em>generally</em> stop happening. The rationale is often not explicitly described, but we can conclude it&rsquo;s likely a financial constraint that typically incentivizes this policy</li> </ol> <p>With those starter insights, now we can get into modeling the details,.</p> <h2 id="model--exercise">Model &amp; Exercise</h2> <p>We&rsquo;re going to build this model using <a href="https://github.com/lethain/systems">lethain/systems</a>. The first version will be relatively simple, albeit with a number of stocks given the size of the model, and then we&rsquo;ll layer on a number of additional features as we iteratively test out a number of different scenarios.</p> <p>I&rsquo;ve chosen to combine the Model and Exercise steps to showcase how each version of the model can inspire new learnings that prompt new questions, that require a new model to answer.</p> <p>If you&rsquo;d rather view the full model and visualizations, each iteration is <a href="https://github.com/lethain/eng-strategy-models/blob/main/BackfillPolicy.ipynb">available on github</a>.</p> <h2 id="backfill-at-level">Backfill-at-level</h2> <p>The first policy we&rsquo;re going to explore is backfilling a departure at the same level. For example, if a SWE2 departs, then you go ahead and backfill them at SWE2. This intuitively makes sense, because you needed a SWE2 before to perform the work, so why would you hire something less senior?</p> <p>There are two new <code>systems</code> concepts introduced in this model:</p> <ol> <li>For easier iteration, we&rsquo;re going to use the systems modeling concept of an &ldquo;information link&rdquo;, which is basically using a stock as a variable to define a flow, Specifically, we&rsquo;ll create a stock named <code>HiringRate</code> with a size of two. Then we&rsquo;ll use that stock&rsquo;s size as to define hiring flows at each career level. In programming terms, you can think as defining a reusible variable, but you can use any stock&rsquo;s size to define flows.</li> <li>There are effectively an infinite number of potential candidates for your company, so we&rsquo;re going to use an infinite stock, represented by initializing a new stock surroundined by <code>[</code> and <code>]</code>. Specifically in this case this is <code>[Candidates]</code>, if we wanted a fixed size stock with 100 people in it, we could have initialized it as <code>Candidates(100)</code>. Depending on what you&rsquo;re modeling both options are useful.</li> </ol> <p>With those in mind, our initial model is defined as:</p> <pre tabindex="0"><code>HiringRate(2) [Candidates] &gt; SWE1(10) @ HiringRate SWE1 &gt; DepartedSWE1 @ Leak(0.1) DepartedSWE1 &gt; SWE1 @ Leak(0.5) Candidates &gt; SWE2(10) @ HiringRate SWE1 &gt; SWE2 @ Leak(0.1) SWE2 &gt; DepartedSWE2 @ Leak(0.1) DepartedSWE2 &gt; SWE2 @ Leak(0.5) Candidates &gt; SWE3(10) @ HiringRate SWE2 &gt; SWE3 @ Leak(0.1) SWE3 &gt; DepartedSWE3 @ Leak(0.1) DepartedSWE3 &gt; SWE3 @ Leak(0.5) Candidates &gt; SWE4(0) @ HiringRate SWE3 &gt; SWE4 @ Leak(0.1) SWE4 &gt; DepartedSWE4 @ Leak(0.1) DepartedSWE4 &gt; SWE4 @ Leak(0.5) </code></pre><p>To confirm that we&rsquo;ve done something reasonable, we can model this using Graphviz.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-1.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>That looks like the same model we sketched before, without the downlevel backfill flows that we haven&rsquo;t yet added to the model, so we&rsquo;re in a good spot.</p> <p>With that confirmed, lets inspect the four distinct flows happening for the SWE2 stock. In order they are:</p> <ol> <li>External candidates being hired at the SWE2 level, at the fixed <code>HiringRate</code> defined here as 2 hires per round</li> <li>SWE1s being promoted to SWE2 at a 10% rate. This is a leak because someone being promoted to SWE2 doesn&rsquo;t mean the other SWE1s disappear</li> <li>SWE2s who are leaving the company at a 10% rate</li> <li>Backfill hires of departed SWE2s, who are rehired at the same level</li> </ol> <p>Running that model, we can see how the populations of the various levels grow over time.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-2.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>Alright, so we can tell that this backfill at level policy is pretty inefficient, because our organization just becomes more and more top-heavy with SWE4s over time. Something needs to change.</p> <h2 id="backfill-at-n-1">Backfill at N-1</h2> <p>To reduce the number of SWE4s in our company, let&rsquo;s update the model to backfill all hires at the level below the departed employee. For example, a departing SWE2 would cause hiring a SWE1. This specifically means replacing all these lines:</p> <pre tabindex="0"><code>DepartedSWE2 &gt; SWE2 @ Leak(0.5) </code></pre><p>To instead hire into the prior level.</p> <pre tabindex="0"><code>DepartedSWE2 &gt; SWE1 @ Leak(0.5) </code></pre><p>The one exception is that SWE1s are still backfilled as SWE1s: as it&rsquo;s the junior-most level, there&rsquo;s no lower level to backfill into.</p> <p>Running this updated model, we get a better looking organization.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-3.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>We&rsquo;re still top-heavy, but we&rsquo;ve turned an exponential growth problem into a linear growth problem, so that&rsquo;s an improvement. However, this is still a very expensive engineering organization to run, and certainly <em>not</em> an organization that&rsquo;s reducing costs.</p> <h2 id="no-hiring">No hiring</h2> <p>One reason our model shows so many SWE4s is because we&rsquo;re hiring at an even rate across all levels, which isn&rsquo;t particularly realistic. Also, it&rsquo;s unlikely that we&rsquo;re growing headcount at all to the extent that we&rsquo;re aiming to reduce our engineering costs over time.</p> <p>We can model this by setting a <code>HiringRate</code> of zero, and then setting more representative initial values for each cohort of engineers (note that I&rsquo;m only showing the changed lines, check <a href="https://github.com/lethain/eng-strategy-models/blob/main/BackfillPolicy.ipynb">on github</a> for the full model):</p> <pre tabindex="0"><code>HiringRate(0) [Candidates] &gt; SWE1(100) @ HiringRate Candidates &gt; SWE2(100) @ HiringRate Candidates &gt; SWE3(100) @ HiringRate Candidates &gt; SWE4(10) @ HiringRate </code></pre><p>Now we&rsquo;re starting out with 100 SWE1s, SWE2, and SWE3s. We have a smaller cohort of SWE4s, with just ten initially. Running the model gives us a updated perspective.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-4.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>We can see that eliminating hiring <em>improves</em> the ratio of SWE4s to the other levels, but it&rsquo;s still just too high. We&rsquo;re ending up with roughly 1.25 SWE1s for each SWE4, when the ratio should be closer to five to one.</p> <h2 id="capped-size-of-swe4s">Capped size of SWE4s</h2> <p>Finally, we&rsquo;re going to introduce a stock with a maximum size. No matter what flows <em>want</em> to accomplish, they cannot grow a flow over that maximum. In this case, we&rsquo;re defining <code>SWE4</code> as a stock with an initial size of 10, and a maximum size of 20.</p> <pre tabindex="0"><code>SWE4(10, 20) Candidates &gt; SWE4 @ HiringRate </code></pre><p>This could also be combined into a one-liner, although it&rsquo;s potentially easy to miss in that case:</p> <pre tabindex="0"><code>Candidates &gt; SWE4(10, 20) @ HiringRate </code></pre><p>With that one change, we&rsquo;re getting close to an engineering organization that works how we want.</p> <p><img src="https://lethain.com/static/blog/strategy/eng-costs-model-5.png" alt="Systems model for engineering promotions and backfill policy."></p> <p>We ratio of SWE4s to other functions is right, although we can see that the backpressure means that we have a surplus of SWE3s in this organization. You could imagine other policy work that might improve that as well, e.g. presumably more SWE3s depart than SWE2s, because the SWE3s see their ability to be promoted is capped by the departure rate of existing SWE4s. However, I think we&rsquo;ve already learned quite a bit from this model, so I&rsquo;m going to end modeling here.</p>Modeling driving onboarding.https://lethain.com/driver-onboarding-model/Sat, 19 Oct 2024 04:00:00 -0700https://lethain.com/driver-onboarding-model/<p>The <a href="https://lethain.com/llm-adoption-strategy/">How should you adopt LLMs?</a> strategy explores how Theoretical Ride Sharing might adopt LLMs. It builds on several models, the first is about <a href="https://lethain.com/dx-llm-model/">LLMs impact on Developer Experience</a>. The second model, documented here, looks at whether LLMs might improve a core product and business problem: maximizing active drivers on their ridesharing platform.</p> <p>In this chapter, we&rsquo;ll cover:</p> <ol> <li>Where the model of ridesharing drivers identifies opportunities for LLMs</li> <li>How the model was sketched and developed using <a href="https://github.com/lethain/systems">lethain/systems</a> package on Github</li> <li>Exercise to exercise this model to learn from it</li> </ol> <p>Let&rsquo;s get started.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="learnings">Learnings</h2> <p>An obvious assumption is making driver onboarding faster would increase the long-term number of drivers in a market. However, this model show that even doubling the rate that we qualify applicant drivers as eligible has little impact on active drivers over time.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-2.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>Conversely, it&rsquo;s clear that efforts to reengage departed drivers has a significant impact on active drivers. We believe that there are potential LLM applications that could encourage departed drivers to return to active driving, for example mapping their rationale for departing against our recent product changes and driver retention promotions could generate high quality, personalized emails.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-4.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>Finally, the model shows that increasing either reactivation of departed or suspended drivers is significantly less impactful than increasing both. If either rate is low, we lose an increasingly large number of drivers over time.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-5.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>The only meaningful opportunities for us to increase active drivers with LLMs are improving those two reactivation rates.</p> <h2 id="sketch">Sketch</h2> <p>The first step in modeling a system is sketching it (using <a href="https://excalidraw.com/">Excalidraw</a> here). Here we&rsquo;re developing a model for onboarding and retaining drivers for a ridesharing application in one city.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-model-1.png" alt="Systems model for onboarding drivers onto a ride-sharing application."></p> <p>The stocks are:</p> <ol> <li><code>City Population</code> is the total population of a city</li> <li><code>Applied Drivers</code> are the number of people who&rsquo;ve applied to be drivers</li> <li><code>Eligible Drivers</code> are the number of applied drivers who meet eligibility criteria (e.g. provided a current drivers license, etc)</li> <li><code>Onboarded Drivers</code> are eligible drivers who have successfully gone through an onboarding program</li> <li><code>Active Drivers</code> are onboarded drivers who are actually performing trips on a weekly basis</li> <li><code>Departed Drivers</code> were active drivers, but voluntarily stopped performing trips (e.g. took a different job)</li> <li><code>Suspended Drivers</code> were active drivers, but involuntarily stopped performing trips (e.g. are no longer allowed to drive on platform)</li> </ol> <p>Looking at the left-to-right flows, there is a flow from each of those stocks to the following stock in the pipeline. These are all simple one-to-one flows, with the exception of those coming from <code>Active Drivers</code> leads to two distinct stocks: <code>Departed Drivers</code> and <code>Suspended Drivers</code>. These represent voluntary and involuntary departures</p> <p>There are a handful of right-to-left, exception path flows to consider as well:</p> <ol> <li><code>Request missing information</code> represents a driver who can&rsquo;t be moved from <code>Applied Drivers</code> to <code>Eligible Drivers</code> because their provided information proved insufficient in a review process</li> <li><code>Re-engage</code> tracks <code>Departed Drivers</code> who have decided to start driving again, perhaps because of a bonus program for drivers who start driving again</li> <li><code>Remove suspension</code> refers to drivers who were involuntarily removed, but who are now allowed to return to driving</li> </ol> <p>This is a fairly basic model, but let&rsquo;s see what we can learn from it.</p> <h2 id="reason">Reason</h2> <p>Now that we&rsquo;ve sketched the system, we can start thinking about which flows are going to have the largest impact, and where an LLM might increase those flows. Some observations from reasoning about it:</p> <ol> <li>If a city&rsquo;s population is infinite, then what really matters in this model is how many new drivers we can encourage to join the system. On the other hand, if a city&rsquo;s population is finite, then onboarding new drivers will be essential in the early stages of coming online in any particular city, but long-term reengaging departed drivers is probably at least as important.</li> <li>LLMs tooling could speed up validating eligible drivers. If we speed that process up enough, we could greatly reduce the rate of the <code>Request missing information</code> flow by identifying missing information in real-time rather than requiring a human to review the information later.</li> <li>We could potentially develop LLM tooling to craft personalized messaging to <code>Departed Drivers</code>, that explains which of our changes since their departure might be most relevant to their reasons for stopping. This could increase the rate of the <code>Re-engage</code> flow</li> <li>While we likely wouldn&rsquo;t want an LLM approving the removal of suspensions, we could have it look at requests to be revalidated, and identify promising requests to focus human attention on the highest potential for approval.</li> <li>We could build LLM-powered tooling that helps a city resident decide whether they should apply to become a driver by answering questions they might have.</li> </ol> <p>As we exercise the model later, we know that our assumptions about whether this city has already exhausted potential drivers will quickly steer us towards a specific subset of these potential options. If all potential drivers are already tapped, only work to reactivate prior drivers that will matter. If there are more potential drivers, then likely activating them will be a better focus.</p> <h2 id="model">Model</h2> <p>For this model, we&rsquo;ll be modeling it using the <a href="https://github.com/lethain/systems">lethain/systems</a> library that I wrote. For a more detailed introduction, I recommend working through <a href="https://github.com/lethain/systems/blob/master/README.md">the tutorial in the repository</a>, but I&rsquo;ll introduce the basics here as well. While <code>systems</code> is far from a perfect tool, as you experiment with different modeling techniques like <a href="https://lethain.com/dx-llm-model/">spreadsheet-based modeling</a> and <a href="https://sagemodeler.concord.org/">SageModeler</a>, I think this approach&rsquo;s emphasis on rapid development and reproducible, sharable models is somewhat unique.</p> <p>If you want to see the finished model, you can find the model and visualizations in <a href="https://github.com/lethain/eng-strategy-models/blob/main/DriverOnboarding.ipynb">the Jupyterhub notebook in lethain:eng-strategy-models.</a>. Here we&rsquo;ll work through the steps behind implementing that model.</p> <p>We&rsquo;ll start by creating a stock for the city&rsquo;s population, with an initial size of 10,000.</p> <pre tabindex="0"><code># City population is 10,000 CityPop(10000) </code></pre><p>Next, we want to initialize the applied drivers stock, and specify a constant rate of 100 people in the city applying to become drivers each round. This will only happen until the 10,000 potential drivers in the city are exhausted, at which point there will be no one left to apply.</p> <pre tabindex="0"><code># 100 folks apply to become drivers per round # the @ 100 format is called a &#34;rate&#34; flow CityPop &gt; AppliedDrivers @ 100 </code></pre><p>Now we want to initialize the eligible drivers stock, and specify that 25% of the folks in applied drivers will advance to become eligible each round.</p> <p>Before we used <code>@ 100</code> to specify a fixed rate. Here we&rsquo;re useing <code>@ Leak(0.25)</code> to specify the idea of 25% of the folks in applied drivers advancing into eligible driver.</p> <pre tabindex="0"><code># 25% of applied drivers become eligible each round AppliedDrivers &gt; EligibleDrivers @ Leak(0.25) </code></pre><p>You could write this as <code>@ 0.25</code>, but you&rsquo;d actually get different behavior, That&rsquo;s because <code>@ 0.25</code> is actually short-hand for <code>@ Conversion(0.25)</code>, which is similar to a leak but destroys the unconverted portion.</p> <p>Using an example to show the difference, let&rsquo;s imagine that we have 100 applied drivers and 100 eligible drivers, and then see the consequences of applying a leak versus a conversion:</p> <ul> <li><code>Leak(0.25)</code> would end with 75 applied drivers and 125 eligible drivers</li> <li><code>Conversion(0.25)</code> would end with 0 applied drivers and 125 eligible drivers</li> </ul> <p>Depending on what you are modeling, you might need leaks, conversions or both.</p> <p>Moving on, next we model out first right-to-left flow. Specifically, the request missing information flow where some eligible drivers end up not being eligible because they need to provide more information.</p> <pre tabindex="0"><code># This is &#34;Request missing information&#34;, with 10% # of folks moving backwards each round EligibleDrivers &gt; AppliedDrivers @ Leak(0.1) </code></pre><p>Note that the syntax for left-to-right and right-to-left flows is identical, without making a distinction.</p> <p>Now, 25% of eligible drivers become onboarded drivers each round.</p> <pre tabindex="0"><code># 25% of eligible drivers onboard each round EligibleDrivers &gt; OnboardedDrivers @ Leak(0.25) </code></pre><p>Then 50% of onboarded drivers become active drivers, actually providing rides.</p> <pre tabindex="0"><code># 50% of onboarded drivers become active OnboardedDrivers &gt; ActiveDrivers @ Leak(0.50) </code></pre><p>The active drivers stock is drained by two flows: drivers who voluntarily depart become departed drivers, and drivers who are suspended become suspended drivers. Both flows take 10% of active drivers each round.</p> <pre tabindex="0"><code># 10% of active drivers depart voluntarily and involuntarily ActiveDrivers &gt; DepartedDrivers @ Leak(0.10) ActiveDrivers &gt; SuspendedDrivers @ Leak(0.10) </code></pre><p>Finally, we also see 5% of departed drivers returning to driving each round. Similarly, we unsuspend 1% of suspended drivers.</p> <pre tabindex="0"><code># 5% of DepartedDrivers become active DepartedDrivers &gt; ActiveDrivers @ Leak(0.05) # 1% of SuspendedDrivers are reactivated SuspendedDrivers &gt; ActiveDrivers @ Leak(0.01) </code></pre><p>We already sketched this model out earlier, but it&rsquo;s worth noting that <code>systems</code> will allow you to export models via <a href="https://graphviz.org/">Graphviz</a>. These diagrams are generally harder to read than a custom drawn one, but it&rsquo;s certainly possible to use this toolchain to combine sketching and modeling into a single step.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-model-2.png" alt="Systems model for onboarding drivers onto a ride-sharing application. Modeling via graphviz."></p> <p>Now that we have the model, we can get to exercise it to learn its secrets.</p> <h2 id="exercise">Exercise</h2> <p>Base model:</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-1.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>Now let&rsquo;s imagine that our LLM-powered tool can speed up eligible drivers, doubling the speed that we move applied drivers to eligible drivers. Instead of 25% of applied drivers becoming eligible each round, we&rsquo;ll instead see 50%.</p> <pre tabindex="0"><code># old AppliedDrivers &gt; EligibleDrivers @ Leak(0.25) # new AppliedDrivers &gt; EligibleDrivers @ Leak(0.50) </code></pre><p>Unfortunately, we can see that even doubling the speed at which we&rsquo;re onboarding drivers to eligible has a minimal impact.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-2.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>To finish testing this hypothesis, we can eliminate the <code>Request missing information</code> flow entirely and see if this changes things meaningfully, commenting out that line.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-3.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>Unfortunately, even eliminating the missing information rate has little impact on the number of active drivers. So it seems like the opportunity for our LLM solutions to increase active drivers are going to need to focus on reactivating existing drivers.</p> <p>Specifically, let&rsquo;s go from 5% of departed drivers reactivating to 20%.</p> <pre tabindex="0"><code># 20% of DepartedDrivers become active # DepartedDrivers &gt; ActiveDrivers @ Leak(0.05) # DepartedDrivers &gt; ActiveDrivers @ Leak(0.2) </code></pre><p>For the first time, we&rsquo;re seeing a significant shift in impact. We reach a much higher percentage of drivers at peak, and even after we exhaust all drivers in a city, the total number of active reaches a higher equilibrium.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-4.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>Presumably increasing the rate that we reactivate suspended drivers from 1% to 2.5% would have a similar, meaningful but smaller impact on active drivers over time. So let&rsquo;s model that change.</p> <pre tabindex="0"><code># 2.5% of SuspendedDrivers are reactivated #SuspendedDrivers &gt; ActiveDrivers @ Leak(0.01) SuspendedDrivers &gt; ActiveDrivers @ Leak(0.025) </code></pre><p>However, surprisingly, the impact of increasing the reactivate of suspended drivers is actually much higher than reengaging departed drivers.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-ride-results-5.png" alt="Line chart showing a faster and a slower onboarding strategy examples."></p> <p>This is an interesting, and somewhat counter-intuitive result. Increasing the rate for both suspended and departed rates is more impactful than increasing either, because ultimately there&rsquo;s a growing population of drivers in the slower deflating stock. This means, surprisingly, that a tool that helps us quickly determine which drivers could be unsuspended might matter more than the small size of the flow indicates.</p> <p>At this point, we&rsquo;ve probably found the primary story that this model wants to tell us: we should focus efforts on reactivating departed and suspended drivers. Changes elsewhere might reduce operational costs of our business, but they won&rsquo;t solve the problem of increasing active drivers.</p>Modeling impact of LLMs on Developer Experience.https://lethain.com/dx-llm-model/Sun, 06 Oct 2024 04:00:00 -0700https://lethain.com/dx-llm-model/<p>In <a href="https://lethain.com/llm-adoption-strategy/">How should you adopt Large Language Models?</a> (LLMs), we considered how LLMs might impact a company&rsquo;s developer experience. To support that exploration, I&rsquo;ve developed a <a href="https://lethain.com/strategy-systems-modeling/">system model</a> of the developing software at the company.</p> <p>In this chapter, we&rsquo;ll work through:</p> <ol> <li>Summary results from this model</li> <li>How the model was developed, both sketching and building the model in a spreadsheet. (As discussed in <a href="https://lethain.com/strategy-systems-modeling/">the overview of systems modeling</a>, I generally would recommend against using spreadsheets to develop most models, but it&rsquo;s educational to attempt doing so once or twice.)</li> <li>Exercise the model to see what it has to teach us</li> </ol> <p>Let&rsquo;s get into it.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="learnings">Learnings</h2> <p>This model&rsquo;s insights can be summarized in three charts. First, the baseline chart, which shows an eventual equilibrium between errors discovered in production and tickets that we&rsquo;ve closed by shipping to production. This equilibrium is visible because tickets continue to get opened, but the total number of closed tickets stop increasing.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-1.png" alt="Chart showing systems modeling"></p> <p>Second, we show that we can shift that equilibrium by reducing the error rate in production. Specifically, the first chart models 25% of closed tickets in production experiencing an error, whereas the second chart models only a 10% error rate. The equilibrium returns, but at a higher value of shipped tickets.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-2.png" alt="Chart showing systems modeling"></p> <p>Finally, we can see that even tripling the rate that we start and test tickets doesn&rsquo;t meaningfully change the total number of completed tickets, as modeled in this third chart.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-4.png" alt="Chart showing systems modeling"></p> <p>The constraint on this system is errors discovered in production, and any technique that changes something else doesn&rsquo;t make much of an impact. Of course, this is just <em>a model</em>, not reality. There are many nuances that models miss, but this helps us focus on what probably matters the most, and in particular highlights that any approach that increases development velocity while also increasing production error rate is likely net-negative.</p> <p>With that summary out of the way, now we can get into developing the model itself.</p> <h2 id="sketch">Sketch</h2> <p>Modeling in a spreadsheet is labor intensive, so we want to iterate as much as possible in the sketching phase, before we move to the spreadsheet. In this case, we&rsquo;re working with <a href="https://excalidraw.com/">Excalidraw</a>.</p> <p><img src="https://lethain.com/static/blog/strategy/llm-dx-model-1.png" alt="Systems model with five stages of development, with numerous lines where discovered errors require moving backwards in flow."></p> <p>I sketched five stocks to represent a developer&rsquo;s workflow:</p> <ol> <li><code>Open Tickets</code> is tickets opened for an engineer to work on</li> <li><code>Start Coding</code> is tickets that an engineer is working on</li> <li><code>Tested Code</code> is tickets that have been tested</li> <li><code>Deployed Code</code> is tickets than have been deployed</li> <li><code>Closed Ticket</code> is tickets that are closed after reaching production</li> </ol> <p>There are four flows representing tickets progressing through this development process from left to right. Additionally, there are three exception flows that move from right to left:</p> <ol> <li><code>Testing found error</code> represents a ticket where testing finds an error, moving the ticket backwards to <code>Start Coding</code></li> <li><code>Deployment exposed error</code> represents a ticket encountering an error during deployment, where it&rsquo;s moved backwards to <code>Start Coding</code></li> <li><code>Error found in production</code> represents a ticket encountering a production error, which causes it to move all the way back to the beginning as a new ticket</li> </ol> <p>One of your first concerns seeing this model might be that it&rsquo;s embarassingly simple. To be honest, that was my reaction when I first looked at it, too. However, it&rsquo;s important to recognize that feeling and then dig into whether it matters.</p> <p>This model is quite simple, but in the next section we&rsquo;ll find that it reveals several counter-intuitive insights into the problem that will help us avoid erroneously viewing the tooling as a failure if time spend testing increases. The value of a model is in refining our thinking, and simple models are usually more effective as refining thinking across a group than complex models, simply because complex models are fairly difficult to align a group around.</p> <h2 id="reason">Reason</h2> <p>As we start to look at this sketch, the first question to ask is how might LLM-based tooling show an improvement? The most obvious options are:</p> <ol> <li> <p>Increasing the rate that tasks flow from <code>Starting coding</code> to <code>Tested code</code>. Presumably these tools might reduce the amount of time spent on implementation.</p> </li> <li> <p>Increasing the rate that <code>Tested code</code> follows <code>Testing found errors</code> to return to <code>Starting code</code> because more comprehensive tests are more likely to detect errors. This is probably the first interesting learning from this model: if the adopted tool works well, it&rsquo;s likely that we&rsquo;ll spend <em>more</em> time in the testing loop, with a long-term payoff of spending less time solving problems in production where it&rsquo;s more expensive. This means that slower testing might be a successful outcome rather than a failure as it might first appear.</p> <p>A skeptic of these tools might argue the opposite, that LLM-based tooling will cause more issues to be identified &ldquo;late&rdquo; after deployment rather than early in the testing phase. In either case, we now have a clear goal to measure to evaluate the effectiveness of the tool: reducing the <code>Error found in production</code> flow. We also know <em>not</em> to focus on the <code>Testing found error</code> flow, which should probably increase.</p> </li> <li> <p>Finally, we can also zoom out and measure the overall time from <code>Start Coding</code> to <code>Closed Ticket</code> for tasks that don&rsquo;t experience the <code>Error found in production</code> flow for at least the first 90 days after being completed.</p> </li> </ol> <p>These observations capture what I find remarkable about systems modeling: even a very simple model can expose counter-intuitive insights. In particular, the sort of insights that build conviction to push back on places where intuition might lead you astray.</p> <h2 id="model">Model</h2> <p>For this model, we&rsquo;ll be modeling it directly in a spreadsheet, specifically Google Sheets. The completed spreadsheet model <a href="https://docs.google.com/spreadsheets/d/1YAego3JiNCUE15GeL_3GQfYmrE1jG9dVF6yzu-mAxLw/edit?gid=1325089804#gid=1325089804">is available here</a>. As discussed in <a href="https://lethain.com/strategy-systems-modeling/">Systems modeling to refine strategy</a>, spreadsheet modeling is brittle, slow and hard to iterate on. I generally recommend that folks attempt to model something in a spreadsheet to get an intuitive sense of the math happening in their models, but I would almost always choose any tool other than a spreadsheet for a complex model.</p> <p>This example is fairly tedious to follow, and you&rsquo;re entirely excused if you decide to pull open the sheet itself, look around a bit, and then skip the remainder of this section. If you are hanging around, it&rsquo;s time to get started.</p> <p>The spreadsheet we&rsquo;re creating has three important worksheets:</p> <ul> <li><em>Model</em> represents the model itself</li> <li><em>Charts</em> holds charts of the model</li> <li><em>Config</em> holds configuration values seperately from the model to ease exercising the model after we&rsquo;ve built it</li> </ul> <p>Going to the model worksheet, we want to start out by initializing each of the columns to the starting value.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-0.png" alt="Screenshot of spreadsheet showing initial values of a systems model"></p> <p>While we&rsquo;ll use formulae for subsequent rows, the first row should contain literal values. I often start with a positive value in the first column and zeros in the other columns, but that isn&rsquo;t required. You can start with whatever starting values are more useful for studying the model that you&rsquo;re building.</p> <p>With the initial values set, we&rsquo;re now going to implement the model in two passes. First, we&rsquo;ll model the left-to-right flows, which represent the standard development process. Second, we&rsquo;ll model the right-to-left flows, which represent exceptions in the process.</p> <h3 id="modeling-left-to-right">Modeling left-to-right</h3> <p>We&rsquo;ll start by modeling the interaction between the first two nodes: <code>Open Tickets</code> and <code>Started Coding</code>. We want to have open tickets increased over time at a fixed rate, so let&rsquo;s add a value in the config worksheet for <code>TicketOpenRate</code>, starting with <code>1</code>.</p> <p>Moving to the second stock, we want to start work on open tickets as long as we have at most <code>MaxConcurrentCodingNum</code> open tickets. If we have more than <code>MaxConcurrentCodingNum</code> tickets that we&rsquo;re working on, then we don&rsquo;t start working on any new tickets. To do this, we actually need to create an intermediate value (represented using an italics column name) to determine how many should be created by checking if the current in started tickets is at maximum (another value in the config sheet) or if we should increment that by one.</p> <p>That looks like:</p> <pre><code>// Config!$B$3 is max started tickets // Config!$B$2 is rate to increment started tickets // $ before a row or column, e.g. $B$3 means that the row or column // always stays the same -- not incrementing -- even when filled // to other cells = IF(C2 &gt;= Config!$B$3, 0, Config!$B$2) </code></pre> <p>This also means that our first column, for <code>Open Tickets</code> is decremented by the number of tickets that we&rsquo;re started coding:</p> <pre><code>// This is the definition of `Open Tickets` =A2 + Config!$B$1 - B2 </code></pre> <p>Leaving us with these values.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-1.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>Now we want to determine the number of tickets being tested at each step in the model. To do this, we create a calculation column, <code>NumToTest?</code> which is defined as:</p> <pre><code>// Config$B$4 is the rate we can start testing tickets // Note that we can only start testing tickets if there are tickets // in `Started Coding` that we're able to start testing =MIN(Config!$B$4, C3) </code></pre> <p>We then add that value to the previous number of tickets being tested.</p> <pre><code>// E2 is prior size of the Tested Code stock // D3 is the value of `NumToTest?` // F2 is the number of tested tickets to deploy =E2 + D3 - F2 </code></pre> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-2.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>Moving on to deploying code, let&rsquo;s keep things simple and start out by assuming that every tested change is going to get deployed. That means the calculation for <code>NumToDeploy?</code> is quite simple:</p> <pre><code>// E3 is the number of tested changes =E3 </code></pre> <p>Then the value for the <code>Deployed Code</code> stock is simple as well:</p> <pre><code>// G2 is the prior size of Deployed Code // F3 is NumToDeploy? // H2 is the number of deployed changes in prior round =G2+F3-H2 </code></pre> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-3.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>Now we&rsquo;re on to the final stock. We add the <code>NumToClose?</code> calculation, which assumes that all deployed changes are now closed.</p> <pre><code>// G3 is the number of deployed changes =G3 </code></pre> <p>This makes the calculation for the <code>Closed Tickets</code> stock:</p> <pre><code>// I2 is the prior value of Closed Tickets // H3 is the NumToClose? =I2 + H3 </code></pre> <p>With that, we&rsquo;ve now modeled the entire left-to-right flows.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-4.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>The left-to-right flows are simple, with a few constrained flows and a very scalable flows, but overall we see things progressing through the pipeline evenly. All that is about to change!</p> <h3 id="modeling-right-to-left">Modeling right-to-left</h3> <p>We&rsquo;ve now finished modeling the happy path from left to right. Next we need to model all the exception paths where things flow right to left. For example, an issue found in production would cause a flow from <code>Closed Ticket</code> back to <code>Open Ticket</code>. This tends to be where models get interesting.</p> <p>There are three right-to-left flows that we need to model:</p> <ol> <li><code>Closed Ticket</code> to <code>Open Ticket</code> represents a bug discovered in production.</li> <li><code>Deployed Code</code> to <code>Start Coding</code> represents a bug discovered during deployment. 3 <code>Tested Code</code> to <code>Start Coding</code> represents a bug discovered in testing.</li> </ol> <p>To start, we&rsquo;re going to add configurations defining the rates of those flows. These are going to be percentage flows, with a certain percentage of the target stock triggering the error condition rather than proceeding. For example, perhaps 25% of the <code>Closed Tickets</code> are discovered to have a bug each round.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-5.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>These are fine starter values, and we&rsquo;ll experiment with how adjusting them changes the model in the <em>Exercise</em> section below.</p> <p>Now we&rsquo;ll start by modeling errors discovered in production, by adding a column to model the flow from <code>Closed Tickets</code> to <code>Open Tickets</code>, the <code>ErrorsFoundInProd?</code> column.</p> <pre><code>// I3 is the number of Closed Tickets // Config!$B$5 is the rate of errors =FLOOR(I3 * Config!$B$5) </code></pre> <p>Note the usage of <code>FLOOR</code> to avoid moving partial tickets. Feel free to skip that entirely if you&rsquo;re comfortable with the concept of fractional tickets, fractional deploys, and so on. This is an aesthetic consideration, and generally only impacts your model if you choose overly small starting values.</p> <p>This means that our calculation for <code>Closed Ticket</code> needs to be updated as well to reduce by the prior row&rsquo;s result for <code>ErrorsFoundInProd?</code>:</p> <pre><code>// I2 is the prior value of ClosedTicket // H3 is the current value of NumToClose? // J2 is the prior value of ErrorsFoundInProd? =I2 + H3 - J2 </code></pre> <p>We&rsquo;re not quite done, because we <em>also</em> need to add the prior row&rsquo;s value of <code>ErrorsInProd?</code> into <code>Open Tickets</code>, which represents the errors&rsquo; flow from closed to open tickets. Based on this change, the calculation for <code>Open Tickets</code> becomes:</p> <pre><code>// A2 is the prior value of Open Tickets // Config!$B$1 is the base rate of ticket opening // B2 is prior row's StartCodingMore? // J2 is prior row's ErrorsFoundInProd? =A2 + Config!$B$1 - B2 + J2 </code></pre> <p>Now we have the full errors in production flow represented in our model.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-6.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>Next, it&rsquo;s time to add the <code>Deployed Code</code> to <code>Start Coding</code> flow. Start by adding the <code>ErrorsFoundInProd?</code> calculation:</p> <pre><code>// G3 is deployed code // Config!$B$6 is deployed error rate =FLOOR(G3 * Config!$B$6) </code></pre> <p>Then we need to update the calculation for <code>Deployed Code</code> to decrease by the calculated value in <code>ErrorsFoundInProd?</code>:</p> <pre><code>// G2 is the prior value of Deployed Code // F3 is NumToDeploy? // H2 is prior row's NumToClose? // I2 is ErrorsFoundInDeploy? =G2 + F3 - H2 - I2 </code></pre> <p>Finally, we need to increase the size of <code>Started Coding</code> by the same value, representing the flow of errors discovered in deployment:</p> <pre><code>// C2 is the prior value of Started Coding // B3 is StartCodingMore? // D2 is prior value of NumToTest? // I2 is prior value of ErrorsFoundInDeploy? =C2 + B3 - D2 + I2 </code></pre> <p>We now have the working flow representing errors in production.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-7.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>Finally, we can added the <code>Tested Code</code> to <code>Started Coding</code> flow. This is pretty much the same as the prior flow we added, starting with adding a <code>ErrorsFoundInTest?</code> calculation:</p> <pre><code>// E3 is tested code // Config!$B$7 is the testing error rate =FLOOR(E3 * Config!$B$7) </code></pre> <p>Then we update <code>Tested Code</code> to reduce by this value:</p> <pre><code>// E2 is prior value of Tested Code // D3 is NumToTest? // G2 is prior value of NumToDeploy? // F2 is prior value of ErrorsFoundInTest? =E2 + D3 - G2 - F2 </code></pre> <p>And update <code>Started Coding</code> to increase by this value:</p> <pre><code>// C2 is prior value of Started Coding // B3 is StartCodingMore? // D2 is prior value of NumToTest? // J2 is prior value of ErrorsFoundInDeploy? // F2 is prior value of ErrorsFoundInTest? = C2 + B3 - D2 + J2 + F2 </code></pre> <p>Now this last flow is instrumented.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-model-screeshot-8.png" alt="Screenshot of spreadsheet showing three columns of systems modeling"></p> <p>With that, we now have a complete model that we can start exercising! This exercise demonstrated both that it&rsquo;s <em>quite possible</em> to represent a meaningful model in a spreadsheet, but also the challenges of doing so.</p> <p>While developing this model, a number of errors became evident. Some of them I was able to fix relatively easily, and even more I left unfixed because fixing them makes the model <em>even harder</em> to reason about. This is a good example of why I encourage developing one or two models in a spreadsheet, but I ultimately don&rsquo;t believe it&rsquo;s the right mechanism to work in for most people: even very smart people make errors in their spreadsheets, and catching those errors is exceptionally challenging.</p> <h2 id="exercise">Exercise</h2> <p>Now that we&rsquo;re done building this model, we can final start the fun part: exercising it. We&rsquo;ll start by creating a simple bar chart showing the size of each stock at each step. We are going to expressly <em>not</em> show the intermediate calculation columns such as <code>NumToTest?</code>, because those are implementation details rather than particularly interesting.</p> <p>Before we start tweaking the values , let&rsquo;s look at the baseline chart.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-1.png" alt="Chart showing systems modeling"></p> <p>The most interesting thing to notice is that our current model doesn&rsquo;t actually increase the number of closed tickets over time. We actually just get further and further behind over time, which isn&rsquo;t too exciting.</p> <p>So let&rsquo;s start modeling the first way that LLMs might help, reducing the error rate in production. Let&rsquo;s shift <code>ErrorsInProd</code> from <code>0.25</code> down to <code>0.1</code>, and see how that impacts the chart.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-2.png" alt="Chart showing systems modeling"></p> <p>We can see that this allows us to make more progress on closing tickets, although at some point equilibrium is established between closed tickets and the error rate in production, preventing further progress. This does validate that reducing error rate in production matters. It also suggests that as long as error rate is a function of everything we&rsquo;ve previously shipped, we are eventually in trouble.</p> <p>Next let&rsquo;s experiment with the idea that LLMs allow us to test more quickly, tripling <code>TicketTestRate</code> from <code>1</code> to <code>3</code>. It turns out, increasing testing rate doesn&rsquo;t change anything at all, because the current constraint is in starting tickets.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-3.png" alt="Chart showing systems modeling"></p> <p>So, let&rsquo;s test that. Maybe LLMs make us faster in starting tickets because <em>overall</em> speed of development goes down. Let&rsquo;s model that by increasing <code>StartCodingRate</code> from <code>1</code> to <code>3</code> as well.</p> <p><img src="https://lethain.com/static/blog/strategy/dx-chart-4.png" alt="Chart showing systems modeling"></p> <p>This is a fascinating result, because tripling development and testing velocity has changed how much work we start, but ultimately the real constraint in our system is the error discovery rate in production.</p> <p>By exercising this model, we find an interesting result. To the extent that our error rate is a function of the volume of things we&rsquo;ve shipped in production, shipping faster doesn&rsquo;t increase our velocity at all. The only meaningful way to increase productivity in this model is to reduce the error rate in production.</p> <p>Models are imperfect representations of reality, but this one gives us a clear sense of what matters the most: if we want to increase our velocity, we have to reduce the rate that we discover errors in production. That might be reducing the error rate as implied in this model, or it might be ideas that exist outside of this model. For example, the model doesn&rsquo;t represent this well, but perhaps we&rsquo;d be better off iterating more on fewer things to avoid this scenario. If we make multiple changes to one area, it still just represents one implemented feature, not many implement features, and the overall error rate wouldn&rsquo;t increase.</p>Testing strategy: avoid the waterfall strategy trap with iterative refinement.https://lethain.com/testing-strategy-iterative-refinement/Wed, 25 Sep 2024 17:00:00 -0700https://lethain.com/testing-strategy-iterative-refinement/<p>If I could only popularize one idea about technical strategy, it would be that prematurely applying pressure to a strategy&rsquo;s rollout prevents evaluating whether the strategy is effective. Pressure changes behavior in profound ways, and many of those changes are intended to make you believe your strategy is working while minimizing change to the status quo (if you&rsquo;re an executive) or get your strategy repealed (if you&rsquo;re not an executive). Neither is particular helpful.</p> <p>While some strategies are obviously wrong from the beginning, it&rsquo;s much more common to see reasonable strategies that fail because they didn&rsquo;t get the small details right. Premature pressure is one common cause of a more general phenomenon: most strategies are developed in a waterfall model, finalizing their approach before incorporating the lessons that realities teaches when you attempt the strategy in practice.</p> <p>One effective mechanism to avoid the waterfall strategy trap is explicitly testing your strategy to refine the details. This chapter describes the mechanics of testing strategy:</p> <ul> <li>when it&rsquo;s important to test strategy (and when it isn&rsquo;t)</li> <li>how to test strategy</li> <li>when you should stop testing</li> <li>roles in testing strategy: sponsor vs guide</li> <li>metrics and meetings to run a testing strategy</li> <li>how to identify a strategy that skipped testing</li> <li>what to do when a strategy has progressed too far without testing</li> </ul> <p>Let&rsquo;s get into the details.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <p><em>Many of the ideas here came together while working with <a href="https://www.linkedin.com/in/shawnamartell/">Shawna Martell</a>, <a href="https://www.linkedin.com/in/danfike/">Dan Fike</a>, <a href="https://www.linkedin.com/in/madhurisarma/">Madhuri Sarma</a>, and many others in Carta Engineering.</em></p> <h2 id="when-to-test-strategy">When to test strategy</h2> <p>Strategy testing is ensuring that a strategy will accomplish its intended goal at a cost that you&rsquo;re willing to pay. This means it needs to happen prior to implementing a strategy, usually in a strategy&rsquo;s early development stages.</p> <p>Afew examples of when to test common strategy topics:</p> <ul> <li>Integrating a recent acquisition might focus on getting a single API integration working before finalizing how the overall approach goes.</li> <li>A developer productivity strategy focused on requiring typing in a Python codebase might start by having an experienced team member type an important module.</li> <li>A service migration might attempt migrating both a simple component (to test migration tooling) and a highly complex component (to test integration complexity) before moving to a broader rollout.</li> </ul> <p>In every case, the two most important pieces are testing before finalizing the strategy, and testing narrowly with a focus on the underlying mechanics of the approach rather than getting caught up in solving broad problems like motivating adoption and addressing conflicting incentives.</p> <p>That&rsquo;s not to say that you need to test every strategy. A few of the common cases where you might not want to test a strategy are:</p> <ul> <li>When you&rsquo;re dealing with a <a href="https://lethain.com/when-write-down-engineering-strategy/">permissive strategy</a> that&rsquo;s very cheap to apply, testing is often not too important; indeed, you can consider most highly-permissive strategies as a test of whether it&rsquo;s effective to implement a similar, but less permissive, strategy in the future.</li> <li>Where testing isn&rsquo;t viable for some reason. For example, a hiring strategy where you shift hiring into certain regions isn&rsquo;t something you can test in most cases, it&rsquo;s something you might need to run for several years to get meaningful signal on results.</li> <li>There are also cases where you have such high conviction in a given strategy that it&rsquo;s not worth testing, perhaps because you&rsquo;ve done something nearly identical at the same company before. Hubris comes before the fall, so I&rsquo;m generally skeptical of this category.</li> </ul> <p>That said, my experience is that you should try very hard to find a way to test every strategy. You certainly should not try hard to convince yourself testing a strategy isn&rsquo;t worthwhile. Testing is so, so much cheaper than implementing a bad strategy, that it&rsquo;s almost always a good investment of time and energy.</p> <h2 id="how-to-test-strategy">How to test strategy</h2> <p>For a valuable step that&rsquo;s so often skipped, testing strategy is relatively straightforward. The approach I&rsquo;ve found effective is:</p> <ol> <li> <p>Identify the narrowest, deepest available slice of your strategy, and iterate on applying your strategy to that slice until you&rsquo;re confident the approach works well.</p> <p>For example, if you&rsquo;re testing a new release strategy for your Product Engineering organization, decide to release exactly one important release following the new approach.</p> </li> <li> <p>As you iterate, identify metrics that help you verify the approach is working; note that these aren&rsquo;t metrics to measure adoption, instead that measure impact of the change.</p> <p>For example, metrics that show the new release process reduces customer impact, or drives more top-of-funnel visitors.</p> </li> <li> <p>Operate from the belief that people are well-meaning, and strategy failures are due to excess friction and poor ergonomics.</p> <p>For example, assume the release tooling is too complex if people aren&rsquo;t using it. (Definitely don&rsquo;t assume that people are too resistent to change.)</p> </li> <li> <p>Keep refining until you have conviction that your strategy&rsquo;s details work in practice, or that the strategy needs to be approached from a new direction.</p> <p>For example, if the metrics you identified before show the new release process has significantly reduced customer impact of the new release.</p> </li> </ol> <p>The most important details are the things <em>not</em> to do. Don&rsquo;t go broad where impact <em>feels</em> higher but iteration cycles are slower. Don&rsquo;t get caught up on <em>forcing</em> adoption such that you&rsquo;re distracted from improving the underlying mechanics. Finally, don&rsquo;t get so attached to your current approach that you can&rsquo;t accept that it might not be working. Testing strategy is only valuable because many strategies don&rsquo;t work as intended, and it&rsquo;s much cheaper to learn that early.</p> <h2 id="testing-roles-sponsors-and-guides">Testing roles: sponsors and guides</h2> <p>Sometimes the strategy testing process is lead by one individual who is able to sponsor the work (a principal engineer at a smaller company, an executive, etc) and also coordinate the day-to-day work of validating the approach (a principal engineer at a larger company, an engineering manager, a technical program manager, etc). It&rsquo;s even more common for these responsibilities to split between two roles: <strong>sponsor</strong> and a <strong>guide</strong>.</p> <p>The <strong>sponsor</strong> is responsible for:</p> <ol> <li>serving as an escalation point to make quick decisions to avoid getting stuck in development stages</li> <li>pushing past historical decisions and beliefs that prevent meaningful testing</li> <li>marshalling cross organizational support</li> <li>telling the story to stakeholders, especially the executive team to avoid getting defunded</li> <li>preventing overloading of strategy (where people want to make the strategy solve <em>their</em> semi-related problem)</li> <li>setting pace to avoid stalling out</li> <li>identifying when energy is dropping and to change phase of stratey (from development to implementation)</li> </ol> <p>The <strong>guide</strong> is responsible for:</p> <ol> <li>translating the strategy into particulars where testing gets stuck</li> <li>identifying slowdowns and blockers</li> <li>escalating frequently to sponsor</li> <li>tracking goals and workstreams</li> <li>maintaining the pace set by the sponsor</li> </ol> <p>In terms of filling these roles, there are a few lessons that I&rsquo;ve learned over time. For sponsors, what matters the most is that they&rsquo;re genuinely authorized by the company to make the decision they&rsquo;re making, and that they care enough about the impact that they&rsquo;re willing to make difficult decisions quickly. A sponsor is only meaningful to the extent that the guide can escalate to the sponsor <em>and</em> they rapidly resolve those escalations. If they aren&rsquo;t available for escalations or don&rsquo;t resolve them quickly, they&rsquo;re a poor sponsor.</p> <p>For guides, you need someone who can execute at pace without getting derailed by various organizational messes, and has good, nuanced judgment relevant to the strategy being tested. The worst guides are ideological (they reject the very feedback created by testing) or easily derailed (you&rsquo;re likely testing <em>because</em> there&rsquo;s friction somewhere, so someone who can&rsquo;t navigate friction is going to fail by default).</p> <h2 id="meetings--metrics">Meetings &amp; Metrics</h2> <p>The only absolute requirement for the strategy testing phase is that the sponsor, guide, and any key folks working on the strategy <strong>must meet together every single week</strong>. Within that meeting, you&rsquo;ll iterate on which metrics capture the current areas you&rsquo;re trying to refine, discuss what you&rsquo;ve learned from prior metrics or data, and schedule one-off followups to ensure you&rsquo;re making progress.</p> <p>The best version of this meeting is debugging heavy and presentation light. Any week that you&rsquo;re not learning something that informs subsequent testing, or making a decision that modifies approach to testing, should be viewed with some suspicion. It might mean that you&rsquo;ve underresourced the testing effort, or that your testing approach is too ambitious, but it&rsquo;s a meaningful signal that testing is converging too slowly to maintain attention.</p> <p>If all of this seems like an overly large commitment, I&rsquo;d push you to consider your <a href="https://lethain.com/when-write-down-engineering-strategy/">strategy altitude</a> to adjust the volume or permisiveness of the strategy you&rsquo;re working on. If a strategy isn&rsquo;t worth testing, then it&rsquo;s either already quite good (which should be widely evident beyond its authors) or it&rsquo;s probably only worth rolling out in a highly permissive format.</p> <h2 id="identifying-strategies-that-skipped-testing">Identifying strategies that skipped testing</h2> <p>While not all strategies <em>must</em> be refined by a testing phase, essentially all failing strategies skip the testing phase to move directly into implementation. Strategies that skip testing <em>sound right</em>, but don&rsquo;t accomplish much. Fully standardizing authorization and authentication across your company on one implementation <em>sounds right</em>, but can still fail if e.g. each team is responsible for its own approach to determining the standard.</p> <p>One particularly obvious pattern is something I describe as &ldquo;pressure without a plan.&rdquo; This is a strategy that is <em>only</em> the &ldquo;sounds right&rdquo; aspect with none of the details. Service migrations are particularly prone to this, perhaps due to apocryphal descriptions of Amazon&rsquo;s service migration in the 2000s, which is often summarized as a top-down zero-details mandate to switch away from the monolith.</p> <p>Identification comes down to understanding two things:</p> <ol> <li> <p>Are there numbers that show the strategy is driving the desired impact? For example, API requests made into the new authentication service as a percentage of all authentication requests is more meaningful than a spreadsheet tracking teams&rsquo; commitments to move to the new service.</p> <p>Try to avoid proxy metrics when possible, but to instead look at the actual thing that matters.</p> </li> <li> <p>If the numbers aren&rsquo;t moving, is there a clear mechanism debugging and solving those issues, and is this team actually making progress? For example, a team who help integration with the new authentication service to understand where limitations are preventing effective adoption, and who are shipping working code.</p> <p>Because the numbers aren&rsquo;t moving, you need to find a different source of meaningful evidence to validate that progress is happening. Generally the best bet is either new software running in a meaningful environment (e.g. production for product code). It&rsquo;s also useful to talk with skeptics or failed integrations, but be cautious of debugging exclusively with skeptics. <br> They&rsquo;re almost always right, but often they are out-of-date, such that they&rsquo;re right but aren&rsquo;t describing current problems.</p> </li> </ol> <p>Unless one of these two identifications are <em>obviously true</em>, then it&rsquo;s very likely that you&rsquo;ve found a strategy that skipped testing.</p> <h2 id="recovering-from-skipped-testing">Recovering from skipped testing</h2> <p>Once you&rsquo;ve recognized a strategy that skipped testing and is now struggling, the next question is what to do about it. <a href="https://lethain.com/decompose-monolith-strategy/">Should we decompose our monolith?</a> looks at recovering from a failing service migration, and is lightly based on my experience dealing with similar, stuck service migration at both Calm and Carta. The answer to a stuck strategy is always: write a new strategy, and make sure <em>not</em> to skip testing this time.</p> <p>Typically, the first step of this new strategy is explicitly pausing the struggling strategy while a new testing phase occurs. This is painful to do, because the folks invested in the current strategy will be upset with you, but there&rsquo;s always going to be people who disagree with any change. Long-term, the only thing that makes most people happy is a successful strategy, and anything that delays progress towards is a poor investment.</p> <p>Sometimes it is difficult to officially pause a struggling strategy, in which case you have to look for an indirect mechanism to implicitly pause without acknowledging it. For example, delaying new services while you take a month to invest into improving service provisioning might give you enough breathing room to test the missing mechanisms from your strategy, without requiring anyone to lose face that the migration is failing. It would be nice to always be able to say these things out loud, but managing personalities is an enduring leadership challenge; even when your an executive, you just have a different set of messy stakeholders.</p> <h2 id="summary">Summary</h2> <p>Testing doesn&rsquo;t determine whether a strategy might be good. It exposes the missing details required to translate a directionally accurate strategy into a strategy that works. After reading this chapter, you know how to lead that translation process as both a sponsor and a guide. You can setup and run the necessary meetings to test a strategy, and also put together the bank of metrics to determine if the strategy is ready to leave refinement and move to a broader rollout.</p>Should we decompose our monolith?https://lethain.com/decompose-monolith-strategy/Sun, 15 Sep 2024 06:00:00 -0700https://lethain.com/decompose-monolith-strategy/<p>From their <a href="https://en.wikipedia.org/wiki/Microservices">first introduction in 2005</a>, the debate between adopting a microservices architecture, a monolithic service architecture, or a hybrid between the two, has become one of the least-reversible decisions that most engineering organizations make. Even migrating to a different database technology is <em>generally</em> a less expensive change than moving from monolith to microservices or from microservices to monolith.</p> <p>The industry has in many ways gone full circle on that debate, from most hyperscalers in the 2010s partaking in a multi-year monolith to microservices migration, to <a href="https://x.com/kelseyhightower/status/940259898331238402">Kelsey Hightower&rsquo;s iconic tweet on the perils of distributed monoliths</a>:</p> <blockquote> <p>2020 prediction: Monolithic applications will be back in style after people discover the drawbacks of distributed monolithic applications. - @KelseyHightower</p> </blockquote> <p>Even as popular sentiment has generally turned away from microservices, many engineering organizations have a bit of both, often the reminents of one or more earlier but incomplete migration efforts. This strategy looks at a theoretical organization stuck with a bit of both approaches, let&rsquo;s call it Theoretical Compliance Company, which is looking to determine its path forward.</p> <p>Here is Theoretical Compliance Company&rsquo;s service architecture strategy.</p> <hr> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reserve order, starting with <em>Explore</em>, then <em>Diagnose</em> and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, <em>Operation</em> has been folded into <em>Policy</em>; second, <em>Refine</em> has been embedded in <em>Diagnose</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy">Policy</h2> <p>Our policy for service architecture is documented here. All exceptions to this policy <strong>must</strong> escalate <em>to</em> a local Staff-plus engineer for their approval, and then escalate <em>with</em> that Staff-plus engineer to the CTO. If you have questions about the policies, ask in <code>#eng-strategy</code>.</p> <p>Our policy is:</p> <ol> <li> <p><strong>Business units should always operate in their own code repository and monolith.</strong> They should not provision many different services. They should rarely work in other business units monoliths. There are nuances in the details: make decisions that bring us closer to the preceeding sentence being true.</p> </li> <li> <p><strong>New integrations across business unit monoliths should be done using gRPC.</strong> The emphasis here is on <em>new</em> integrations; it&rsquo;s desirable but not urgent to migrate existing integrations that use other implementations (HTTP/JSON, etc).</p> <p>When the decision is subtle (e.g. changes to an existing endpoint), optimize for business velocity rather than technical purity. When the decision is far from subtle (e.g. brand new endpoint), comply with the policy.</p> </li> <li> <p><strong>Except for new business unit monoliths, we don&rsquo;t allow new services.</strong> You should work within the most appropriate business unit monolith or within the existing infrastructure repositories. Provisioning a new service, unless it corresponds with a new business unit, always requires approval from the CTO in <code>#eng-strategy</code>.</p> <p>That approval generally will <em>not</em> be granted, unless the new service requires significantly different non-functional requirements than an existing monolith. For example, if it requires significantly higher compliance review prior to changes such as our existing payments service, or if it requires radically higher requests per second, and so on.</p> </li> <li> <p><strong>Merge existing services into business-unit monoliths where you can.</strong> We believe that each choice to move existing services back into a monolith should be made &ldquo;in the details&rdquo; rather than from a top-down strategy perspective. Consequently, we generally encourage teams to wind down their existing services outside of their business unit&rsquo;s monolith, but defer to teams to make the right decision for their local context.</p> </li> </ol> <h2 id="diagnose">Diagnose</h2> <p>Theoretical Compliance Company has a complex history with decomposing our monolith. We are also increasing our number of business units, while limiting our investment into our core business unit. These are complex times, with a lot of constraints to juggle. To improve readability, we&rsquo;ve split the diagnosis into two sections: &ldquo;business constraints&rdquo; and &ldquo;engineering constraints.&rdquo;</p> <p>Our business constraints are:</p> <ol> <li> <p>We sell business-to-business compliance solutions to other companies on an annual subscription. There is one major, established business line, and two smaller partially validated business lines that are intended to attach to the established business line to increase average contract value.</p> </li> <li> <p>There are 2,000 people at the company. About 500 of those are in the engineering organization Within that 500, about 150 work on the broadest definition of &ldquo;infrastructure engineering,&rdquo; things like developer tools, compute and orchestration, networking, security engineering, and so on.</p> </li> <li> <p>The business is profitable, but revenue growth has been 10-20% YoY, creating persistent pressure on spend from our board, based on mild underperformance relative to public market comparables. <strong>Unless we can increase YoY growth by 5-10%, they expect us to improve free cash flow by 5-10% each year</strong>, which jeopardizes our ability to maintain long-term infrastructure investments.</p> </li> <li> <p><strong>Growth in the primary business line is shrinking.</strong> The company&rsquo;s strategy includes spinning up more adjacent business units to increase average contract value with new products. <strong>We need to fund these business units without increasing our overall budget</strong>, which means budget for the new business units must be pulled away from either our core business or our platform teams.</p> <p>In addition to needing to fund our new business units, <strong>there&rsquo;s ongoing pressure to make our core business more efficient</strong>, which means either accelerating growth or reducing investment. It&rsquo;s challenging to accelerate growth while reducing investment, which suggests that most improvement will come from reducing our investment</p> </li> <li> <p>Our methodology to allocate platform costs against business units does so proportionately to the revenue created by each business unit. <strong>Our core business generates the majority of our revenue, which means it is accountable for the majority of our platform costs</strong>, even if those costs are motivated by new business lines.</p> <p>This means that, even as the burden placed on platform teams increases due to spinning up multiple business units, there&rsquo;s significant financial pressure to reduce our platform spend because it&rsquo;s primarily represented as a cost to the core business whose efficiency we have to improve. This means we have little tolerance for anything that increases infrastructure overhead.</p> </li> </ol> <p>Our engineering constraints are:</p> <ol> <li> <p>Our infrastructure engineering team is 150 engineers supporting 350 product engineers, and it&rsquo;s certain that <strong>infrastructure will not grow significantly in the forseeable future</strong>.</p> </li> <li> <p>We spun up two new business units in the past six months, and <strong>plan to spin up an additional two new business units</strong> in the next year. Each business unit is lead by a general manager, with engineering and product within that business unit principally accountable to that general manager. Our CTO and CPO still set practice standards, but it&rsquo;s situationally-specific whether these practice standards or direction from general manager is the last word on any given debate.</p> <p>For example, one business unit has been unwilling to support an on-call rotation for their product, because their general manager insists it is a wasteful practice. Consequently, that team often doesn&rsquo;t respond to pages, even when their service is responsible for impacting the stability of shared functionality.</p> </li> <li> <p>We&rsquo;ve modeled <a href="https://lethain.com/services-overhead-model/">how services and monoliths create overhead for both product and infrastructure organizations over time</a>, and have conviction that, in general, <strong>it&rsquo;s more overhead for infrastructure to support more services</strong>. We also found that in our organization, the rate of service ownership changing due to team reorganizations counteract much of the initial productivity gains from leaving the monolith.</p> </li> <li> <p>There is some tension between the two preceeding observations: it&rsquo;s generally more overhead to have more services, but it&rsquo;s <em>even more</em> overhead to have unresponsible business units breaking a shared monolithic service. For example, we can much more easily ratelimit usage from a misbehaving service than wrong a misbehaving codepath within a shared service.</p> </li> <li> <p>We also have a payments service that moves money from customers to us. <strong>Our compliance and security requirements for changes to this service are significantly higher</strong> than for the majority of our software, because the blast radius is essentially infinite.</p> </li> <li> <p>Our primary programming language is Ruby, which generally relies on blocking IO, and service-oriented architectures generally spend more time on blocking IO than monoliths. Similarly, Ruby is <em>relatively</em> inefficient at serializing and deserializing JSON payloads, which our service architecture requires as part of cross-service communication.</p> </li> <li> <p>We&rsquo;ve previously attempted to decompose, and have <strong>a number of lingering partial migrations that don&rsquo;t align cleanly with our current business unit ownership structure</strong>. The number of these new services continues to grow over time, creating more burden on both infrastructure today and product teams in the future as they try to maintain these services through various team reorganizations.</p> </li> </ol> <h2 id="explore">Explore</h2> <p>In the late 2010s, most large or scaling companies adopted services to some extent. Few adopted microservices, with the majority of adopters opting for a <a href="https://aws.amazon.com/compare/the-difference-between-soa-microservices/">service-oriented architecture</a> instead. <a href="https://x.com/kelseyhightower/status/940259898331238402">Kelsey Hightower&rsquo;s iconic tweet on the perils of distributed monoliths</a> in 2020 captured the beginning of a reversal, with more companies recognizing the burden of operating service-oriented architectures.</p> <p>In addition to the wider recognition of those burdens, many of the cloud infrastructure challenges that originally motivated service architectures began to mellow. Most infrastructure engineers today <em>only</em> know how to operate with cloud-native patterns, rather than starting from machine oriented approaches. Standard database technologies like PostgreSQL have significantly improved capabilities. Cloud providers have fast local caches for quickly retrieving verified upstream packages. Supply and cost of cloud compute is affordable. Slow programming languages are faster than they were a decade ago. Untyped languages have reasonable incremental paths to typed codebases.</p> <p>As a result of this shift, if you look at a new, emerging company it&rsquo;s particularly likely to have a monolith in one backend and one frontend programming language. However, if you look at a five-plus year old company, you might find almost anything. One particularly common case is a monolith with most functionality, and an inconsistent constellation of team-scoped macroservices scattered across the organization.</p> <p>The shift away from <a href="https://en.wikipedia.org/wiki/Zero_interest-rate_policy">zero interest-rate policy</a> has also impacted trends, as service-oriented architectures tend to require more infrastructure to operate efficiently, such as service meshes, service provisioning and deprovisioning, etc. Properly tuned, service-oriented architectures ought to be cost competitive, and potentially superior in complex workloads, but it&rsquo;s hard to maintain the required investment in infrastructure teams when in a cost-cutting environment. This has encouraged new companies to restrict themselves to monolithic approaches, and pushed existing companies to <em>attempt</em> to reverse their efforts to decompose their prior monoliths, with mixed results.</p>Executive translation.https://lethain.com/executive-translation/Sat, 07 Sep 2024 08:00:00 -0700https://lethain.com/executive-translation/<p>One of my most unexpectedly controversial posts is <a href="https://lethain.com/extract-the-kernel/">Extract the Kernel</a>, which argues that executives are generally directionally correct but specifically wrong, and it&rsquo;s your job to understand the overarching direction without getting distracted by the narrow errors in their idea.</p> <p>Some executives are skeptical of this idea because they don&rsquo;t like the implication that they&rsquo;re usually wrong, but they weren&rsquo;t the audience that was offended. But the folks who got particularly upset were non-executives who felt it was unfair for them to have to debug the executives&rsquo; communication. The fair solution, some argued, is for the executives to become better communicators rather than requiring others around them to become better listeners. For what it&rsquo;s worth, I agree with them, that would be more fair, but I&rsquo;ve always found it much more productive to focus on how I can improve my approach than to document ways that others could improve theirs.</p> <p>Recently I&rsquo;ve been repeating a similar idea to &ldquo;Extract the kernel&rdquo;, but rather than focusing on <em>understanding</em> executives, it&rsquo;s instead focused on leading change when working with executives. Often you&rsquo;ll hear an executive say something that you disagree with, and in that moment you have to determine whether you can directly steer the executive towards a better decision. If you can, then do that! If you can&rsquo;t, then focus on translating the executive&rsquo;s idea into something useful!</p> <p>Many high-agency managers try to prevent executives from doing silly things, but it&rsquo;s almost always more effective to translate their energy for a silly thing into energy for a useful thing. It also leaves the executive feeling supported by your work rather than viewing you as an obstacle to their progress.</p> <p>Some examples:</p> <ul> <li>Executive is obsessed with adopting LLMs in your product. Translate that into a useful approach rather than fighting that LLMs aren&rsquo;t useful in most products.</li> <li>Executive wants to expand into a new business unit without any additional hiring. Use a hackathon to get a sample concept that you can use to validate the business with users.</li> <li>Executive wants to do a giant rewrite of your product. Translate into a narrow test rewrite of a small feature to support quickly refining the approach.</li> </ul> <p>In each of these cases, the executive&rsquo;s idea is <em>more likely</em> to succeed based on your actions. The idea is <em>also</em> more likely to fail quickly. In both cases, you&rsquo;ll have worked to support the executive and the company in a thoughtful, effective way. (These are all good examples of making <a href="https://lethain.com/multi-dimensional-tradeoffs/">effective multi-dimensional tradeoffs</a>!)</p>Video of Developing Eng Leadership Styles.https://lethain.com/video-developing-leadership-styles/Sat, 07 Sep 2024 07:00:00 -0700https://lethain.com/video-developing-leadership-styles/<p>The last chapter I wrote for <em>Eng Executive&rsquo;s Primer</em> was <a href="https://lethain.com/developing-leadership-styles/">this one about developing engineering leadership styles</a>. It&rsquo;s an interesting chapter to me peronally, precisely because it&rsquo;s not something I would have agreed with or written five years ago.</p> <p>This past Friday I gave a conference talk on this topic at <a href="https://leaddev.com/leadingeng-new-york/program">LeadingEng New York, 2024</a>. If you&rsquo;re interested, you can watch a recording of an earlier practice session from a few days before the talk, and can <a href="https://docs.google.com/presentation/d/1mwn4VF36oxeEQxoa9O7cWi7nYUsGWCylV3amNYYqVV0/edit#slide=id.g272560649f6_0_60">review the slides</a>. I think that the practice session is quite a bit worse than the final talk, which I believe is restricted to LeadingEng attendees.</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/_WmcyiWM57A?si=y18mkiYwfIooSz6d" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> <p>You can also <a href="https://www.youtube.com/watch?v=_WmcyiWM57A">watch the talk on Youtube directly</a>. The content is similar to the chapter, but takes a bit of a different angle in exploring it.</p> <p><img src="https://lethain.com/static/blog/2024/leadeng-2024.jpg" alt="Picture of me presenting on stage at LeadingEng 2024 in NYC."></p> <p>(Photograph from my former colleague <a href="https://x.com/shidoshi/status/1832049324034830636">@shidoshi</a> on X.)</p> <p>Regarding the conference itself, this was my first LeadDev conference that I&rsquo;d been to, and it was an extremely well put together conference. I personally enjoyed the mix of 20 minute talks (one of which I gave) and 60 minute deliberate working sessions (e.g. <a href="https://www.linkedin.com/in/ashley-miller-1905757/">Ashley Miller</a> guided through an exercise on honing commercial awareness, and <a href="https://www.linkedin.com/in/catherine-miller-0177142/">Cat Miller</a> steered refining technical strategy). The working sessions created a nice, ongoing peer learning experience, that reminded me of a transient version of the <a href="https://lethain.com/crowdsourcing-cto-vpe-learning-circles/">CTO circle</a> that I run with Uma Chingunde. The working sessions all had written prompts in high-quality printed notebooks, which I thought was quite nice as well, and makes me want to run my next offsite with printed prompts.</p>