Irrational Exuberancehttps://lethain.com/Recent content on Irrational ExuberanceHugo -- gohugo.ioen-usWill LarsonThu, 13 Feb 2025 04:00:00 -0700Exploring for strategy.https://lethain.com/exploring-for-strategy/Thu, 13 Feb 2025 04:00:00 -0700https://lethain.com/exploring-for-strategy/<p>A surprising number of strategies are doomed from inception because their authors get attached to one particular approach without considering alternatives that would work better for their current circumstances. This happens when engineers want to pick tools solely because they are trending, and when executives insist on adopting the tech stack from their prior organization where they felt comfortable.</p> <p>Exploration is the antidote to early anchoring, forcing you to consider the problem widely <em>before</em> evaluating any of the paths forward. Exploration is about updating your priors before assuming the industry hasn&rsquo;t evolved since you last worked on a given problem. Exploration is continuing to believe that things can get better when you&rsquo;re not watching.</p> <p>This chapter covers:</p> <ul> <li>The goals of the exploration phase of strategy creation</li> <li>When to explore (always first!) and when it makes sense to stop exploring</li> <li>How to explore a topic, including discussion of the most common mechanisms: mining for internal precedent, reading industry papers and books, and leveraging your external network</li> <li>Why avoiding judgment is an essential part of exploration</li> </ul> <p>By the end of this chapter, you&rsquo;ll be able to conduct an exploration for the current or next strategy that you work on.</p> <h2 id="what-is-exploration">What is exploration?</h2> <p>One of the frequent senior leadership anti-patterns I&rsquo;ve encountered in my career is <a href="https://lethain.com/grand-migration/">The Grand Migration</a>, where a new leader declares that a massive migration to a new technology stack&ndash;typically the stack used by their former employer&ndash;will solve every pressing problem. What&rsquo;s distinguishing about the Grand Migration is not the initially bad selection, but the single-minded ferocity with which the senior leader pushes for their approach, even when it becomes abundantly clear to others that it doesn&rsquo;t solve the problem at hand.</p> <p>These senior leaders are very intelligent, but have allowed themselves to be framed in by their initial thinking from prior experiences. Accepting those early thoughts as the foundation of their strategy, they build the entire strategy on top of those ideas, and eventually there is so much weight standing on those early assumptions that it becomes impossible to acknowledge the errors.</p> <p>Exploration is the deliberate practice of searching through a strategy&rsquo;s problem and solution spaces before allowing yourself to commit to a given approach. It&rsquo;s understanding how others have approached the same problem recently and in the past. It&rsquo;s doing this both in trendy companies you admire, and in practical companies that actually resemble yours.</p> <p>Most exploration will be external to your team, but depending on your company, much of your exploration might be internal to the company. If you&rsquo;re in a massive engineering organization of 100,000, there are likely existing internal solutions to your problem that you&rsquo;ve never heard of. Conversely, if you&rsquo;re in an organization of 50 engineers, it&rsquo;s likely that much of your exploration will be external.</p> <h2 id="when-to-explore">When to explore</h2> <p>Exploration is the first step of good strategy work. Even when you want to skip it, you will always regret skipping it, because you&rsquo;ll inadvertently frame yourself into whatever approach you focus on first. Especially when it comes to problems that you&rsquo;ve solved previously, exploration is the only thing preventing you from over-indexing on your prior experiences.</p> <p>Try to continue exploration until you know how three similar teams within your company and three similar companies have recently solved the same problem. Further, make sure you are able to explain the thinking behind those decisions. At that point,you should be ready to stop exploring and move on to the <a href="https://lethain.com/diagnosis-for-strategy/">diagnosis step</a> of strategy creation.</p> <p>Exploration should always come with a minimum and maximum timeframe: less than a few hours is very suspicious, and more than a week is generally questionably as well.</p> <h2 id="how-to-explore">How to explore</h2> <p>While the details of each exploration will differ a bit, the overarching approach tends to be pretty similar across strategies. After I open up the draft strategy document I&rsquo;m working on, my general approach to exploration is:</p> <ol> <li> <p>Start throwing in every resource I can think of related to that problem.</p> <p>For example, in the <a href="https://lethain.com/uber-service-migration-strategy/">Uber service provisioning strategy</a>, I started by collecting recent papers on Mesos, Kubernetes, and Aurora to understand the state of the industry on orchestration.</p> </li> <li> <p>Do some web searching, foundational model prompting, and checking with a few current and prior colleagues about what topics and resources I might be missing.</p> <p>For example, for the <a href="https://lethain.com/calm-product-eng-company/">Calm engineering strategy</a>, I focused on talking with industry peers on tools they&rsquo;d used to focus a team with diffuse goals.</p> </li> <li> <p>Summarize the list of resources I&rsquo;ve gathered, organizing them by which I want to explore, and which I won&rsquo;t spend time on but are worth mentioning.</p> <p>For example, the <a href="https://lethain.com/llm-adoption-strategy/">Large Language Model adoption strategy</a>&rsquo;s exploration section documents the variety of resources the team explored before completing it.</p> </li> <li> <p>Work through the list one by one, continuing to collect notes in the strategy document. When you&rsquo;re done, synthesize those into a concise, readable summary of what you&rsquo;ve learned.</p> <p>For example, the <a href="https://lethain.com/decompose-monolith-strategy/">monolith decomposition strategy</a> synthesizes the exploration of a broad topic into four paragraphs, with links out to references.</p> </li> <li> <p>Stop once I generally understand how a handful of similar internal and external teams have recently approached this problem.</p> </li> </ol> <p>Of all the steps in strategy creation, exploration is inherently open-ended, and you may find a different approach works better for you. If you&rsquo;re not sure what to do, try following the above steps closely. If you have a different approach that you&rsquo;re confident in&ndash;as long as it&rsquo;s not skipping exploration!&ndash;then go ahead and try that instead.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>While not discussed in this chapter, you can also use some techniques like <a href="wardley-mapping/">Wardley mapping</a>, covered in the <a href="https://lethain.com/refining-eng-strategy/">Refinement chapter</a>, to support your exploration phase. Wardley mapping is a strategy tool designed within a different strategy tradition, and consequently categorizing it as either solely an exploration tool or a refinement tool ignores some of its potential uses.</p> <p>There&rsquo;s no perfect way to do strategy: take what works for you and use it.</p> </div> <h2 id="mine-internal-precedent">Mine internal precedent</h2> <p>One of the most powerful forms of strategy is simply documenting how similar decisions have been made internally: often this is enough to steer how similar future decisions are made within your organization. This approach, documented in <em>Staff Engineer</em>&rsquo;s <a href="https://staffeng.com/guides/engineering-strategy/">Write five, then synthesize</a>, is also the most valuable step of exploration for those working in established companies.</p> <p>If you are a tenured engineer within your organization, then it&rsquo;s somewhat safe to assume that you are aware of the typical internal approaches. Even then, it&rsquo;s worth poking around to see if there are any related skunkworks projects happening internally. This is doubly true if you&rsquo;ve joined the organization recently, or are distant from the codebase itself. In that case, it&rsquo;s almost always worth poking around to see what already exists.</p> <p>Sometimes the internal approach isn&rsquo;t ideal, but it&rsquo;s still superior because it&rsquo;s already been implemented and there&rsquo;s someone else maintaining it. In the long-run, your strategy can ride along as someone else addresses the issues that aren&rsquo;t perfect fits.</p> <h2 id="using-your-network">Using your network</h2> <p><a href="https://lethain.com/user-data-access-strategy/">How should we control access to user data</a>&rsquo;s exploration section begins with:</p> <blockquote> <p>Our experience is that best practices around managing internal access to user data are widely available through our networks, and otherwise hard to find. The exact rationale for this is hard to determine,</p></blockquote> <p>While there are many topics with significant public writing out there, my experience is that there are many topics where there&rsquo;s very little you can learn without talking directly to practitioners. This is especially true for security, compliance, operating at truly large scale, and competitive processes like optimizing advertising spend.</p> <p>Further, it&rsquo;s surprisingly common to find that how people publicly describe solving a problem and how they actually approach the problem are largely divorced.</p> <p>This is why having a broad personal network is exceptionally powerful, and makes it possible to quickly understand the breadth of possible solutions. It also provides access to the practical downsides to various approaches, which are often omitted when talking to public proponents.</p> <p>In a recent strategy session, a proposal came up that seemed off to me, and I was able to text&ndash;and get answers to those texts&ndash;industry peers before the meeting ended, which invalidated the room&rsquo;s assumptions about what was and was not possible. A disagreement that might have taken weeks to resolve was instead resolved in a few minutes, and we were able to figure out next steps in that meeting rather than waiting a week for the next meeting when we&rsquo;d realized our mistake.</p> <p>Of course, it&rsquo;s <em>also</em> important to hold information from your network with skepticism. I&rsquo;ve certainly had my network be wrong, and your network never knows how your current circumstances differ from theirs, so blindly accepting guidance from your network is never the right decision either.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>If you&rsquo;re looking for a more detailed coverage on building your network, this topic has also come up in <em>Staff Engineer</em>&rsquo;s chapter on <a href="https://staffeng.com/guides/network-of-peers/">Build a network of peers</a>, and <em>The Engineering Executive&rsquo;s Primer</em>&rsquo;s chapter on <a href="https://lethain.com/building-exec-network/">Building your executive network</a>. It feels silly to cover the same topic a third time, but it&rsquo;s a foundational technique for effective decision making.</p> </div> <h2 id="read-widely-read-narrowly">Read widely; read narrowly</h2> <p>Reading has always been an important part of my strategy work. There are two distinct motions to this approach: read widely on an ongoing basis to broaden your thinking, and read narrowly on the specific topic you&rsquo;re working on.</p> <p>Starting with reading widely, I make an effort each year to read ten to twenty industry-relevant works. These are not necessarily new releases, but are new releases <em>for me</em>. Importantly, I try to read things that I don&rsquo;t know much about or that I initially disagree with. Some of my recent reads were <em><a href="https://www.amazon.com/Chip-War-Worlds-Critical-Technology/dp/1982172002">Chip War</a></em>, <em><a href="https://www.amazon.com/Building-Green-Software-Sustainable-Development/dp/1098150627">Building Green Software</a></em>, <em><a href="https://learning.oreilly.com/library/view/tidy-first/9781098151232/">Tidy First?</a></em>, and <em><a href="https://www.amazon.com/How-Big-Things-Get-Done-ebook/dp/B0B3HS4C98/">How Big Things Get Done</a></em>. From each of these books, I learned something, and over time they&rsquo;ve built a series of bookmarks in my head about ideas that might apply to new problems.</p> <p>On the other end of things is reading narrowly. When I recently started working on an AI agents strategy, the first thing I did was read through Chip Huyen&rsquo;s <em><a href="https://www.amazon.com/AI-Engineering-Building-Applications-Foundation/dp/1098166302">AI Engineering</a></em>, which was an exceptionally helpful survey. Similarly, when we started thinking about <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration</a>, we read a number of industry papers, including <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf">Large-scale cluster management at Google with Borg</a> and <a href="https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf">Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center</a>.</p> <p>None of these readings had all the answers to the problems I was working on, but they did an excellent job at helping me understand the range of options, as well as identifying other references to consult in my exploration.</p> <p>I&rsquo;ll mention two nuances that will help a lot here. First, I highly encourage getting comfortable with skimming books. Even tightly edited books will have a lot of content that isn&rsquo;t particularly relevant to your current goals, and you should skip that content liberally. Second, what you read doesn&rsquo;t have to be books. It can be blog posts, essays, interview transcripts, or certainly it can be books.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>In this context, &ldquo;reading&rdquo; doesn&rsquo;t event have to actually be reading. There are conference talks that contain just as much as a blog post, and conferences that cover as much breadth as a book. There are also conference talks without a written equivalent, such as Dan Na&rsquo;s excellent <a href="https://blog.danielna.com/talks/pushing-through-friction">Pushing Through Friction</a>.</p> </div> <h2 id="each-job-is-an-education">Each job is an education</h2> <p>Experience is frequently disregarded in the technology industry, and there are ways to misuse experience by copying too liberally the solutions that worked in different circumstances, but the most effective, and the slowest, mechanism for exploring is continuing to work in the details of meaningful problems.</p> <p>You probably won&rsquo;t <a href="https://lethain.com/forty-year-career/">choose every job to optimize for learning</a>, but allowing you to instantly explore more complex problems over time&ndash;recognizing that a bit of your data will have become stale each time&ndash;is uniquely valuable.</p> <h2 id="save-judgment-for-later">Save judgment for later</h2> <p>As I&rsquo;ve mentioned several times, the point of exploration is to go broad with the goal of understanding approaches you might not have considered, and invalidating things you initially think are true. Both of those things are only possible if you save judgment for later: if you&rsquo;re passing judgment about whether approaches are &ldquo;good&rdquo; or &ldquo;bad&rdquo;, then your exploration is probably going astray.</p> <p>As a soft rule, I&rsquo;d argue that if no one involved in a strategy has changed their mind about something they believed when you started the exploration step, then you&rsquo;re not done exploring. This is <em>especially</em> true when it comes to strategy work by senior leaders. Their beliefs are often well-justified by years of experience, but it&rsquo;s unclear which parts of their experience have become stale over time.</p> <h2 id="summary">Summary</h2> <p>At this point, I hope you feel comfortable exploring as the first step of your strategy work, and understand the likely consequences of skipping this step. It&rsquo;s not an overstatement to say that every one of the worst strategic failures I&rsquo;ve encountered would have been prevented by its primary author taking a few days to explore the space before anchoring on a particular approach.</p> <p>A few days of feeling slow are always worth avoiding years of misguided efforts.</p>How should we control access to user data?https://lethain.com/user-data-access-strategy/Fri, 07 Feb 2025 06:00:00 -0700https://lethain.com/user-data-access-strategy/<p>At some point in a startup&rsquo;s lifecycle, they decide that they need to be ready to go public in 18 months, and a flurry of IPO-readiness activity kicks off. This strategy focuses on a company working on IPO readiness, which has identified a gap in their internal controls for managing access to their users&rsquo; data. It&rsquo;s a company that <em>wants</em> to meaningfully improve their security posture around user data access, but which has had a number of failed security initiatives over the years.</p> <p>Most of those initiatives have failed because they significantly degraded internal workflows for teams like customer support, such that the initial progress was reverted and subverted over time, to little long-term effect. This strategy represents the Chief Information Security Officer&rsquo;s (CISO) attempt to acknowledge and overcome those historical challenges while meeting their IPO readiness obligations, and&ndash;most importantly&ndash;doing right by their users.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reverse order, starting with <em>Explore</em>, then <em>Diagnose</em> and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, <em>Operation</em> has been folded into <em>Policy</em>; second, <em>Refine</em> has been embedded in <em>Diagnose</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy--operations">Policy &amp; Operations</h2> <p>Our new policies, and the mechanisms to operate them are:</p> <ul> <li> <p><strong>Controls for accessing user data must be significantly stronger prior to our IPO.</strong> Senior leadership, legal, compliance and security have decided that we are not comfortable accepting the status quo of our user data access controls as a public company, and must meaningfully improve the quality of resource-level access controls as part of our pre-IPO readiness efforts.</p> <p>Our Security team is accountable for the exact mechanisms and approach to addressing this risk.</p> </li> <li> <p><strong>We will continue to prioritize a hybrid solution to resource-access controls.</strong> This has been our approach thus far, and the fastest available option.</p> </li> <li> <p><strong>Directly expose the log of our resource-level accesses to our users.</strong> We will build towards a user-accessible log of all company accesses of user data, and ensure we are comfortable explaining each and every access. In addition, it means that each rationale for access must be comprehensible and reasonable from a user perspective.</p> <p>This is important because it aligns our approach with our users&rsquo; perspectives. They will be able to evaluate how we access their data, and make decisions about continuing to use our product based on whether they agree with our use.</p> </li> <li> <p><strong>Good security discussions don&rsquo;t frame decisions as a compromise between security and usability.</strong> We will pursue <a href="https://lethain.com/multi-dimensional-tradeoffs/">multi-dimensional tradeoffs</a> to simultaneously improve security and efficiency. Whenever we frame a discussion on trading off between security and utility, it&rsquo;s a sign that we are having the wrong discussion, and that we should rethink our approach.</p> <p>We will prioritize mechanisms that can both automatically authorize <em>and</em> automatically document the rationale for accesses to customer data. The most obvious example of this is automatically granting access to a customer support agent for users who have an open support ticket assigned to that agent. (And removing that access when that ticket is reassigned or resolved.)</p> </li> <li> <p><strong>Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale.</strong> This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues&rsquo; internal tools. If we only expand requirements for accessing customer data, we won&rsquo;t view this as progress because it&rsquo;s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won&rsquo;t represent this as progress, because we won&rsquo;t have increased the number of supported requests.</p> <p>As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data.</p> </li> <li> <p><strong>Expire unused roles to move towards principle of least privilege.</strong> Today we have a number of roles granted in our role-based access control (RBAC) system to users who do not use the granted permissions. To address that issue, we will automatically remove roles from colleagues after 90 days of not using the role&rsquo;s permissions.</p> <p>Engineers in an active on-call rotation are the exception to this automated permission pruning.</p> </li> <li> <p><strong>Weekly reviews until we see progress; monthly access reviews in perpetuity.</strong> Starting now, there will be a weekly sync between the security engineering team, teams working on customer data access initiatives, and the CISO. This meeting will focus on rapid iteration and problem solving.</p> <p>This is explicitly a forum for ongoing <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a>, with CISO serving as the meeting&rsquo;s sponsor, and their Principal Security Engineer serving as the meeting&rsquo;s guide. It will continue until we have clarity on the path to 100% coverage of user-comprehensible, automated rationales for access to customer data.</p> <p>Separately, we are also starting a monthly review of sampled accesses to customer data to ensure the proper usage and function of the rationale-creation mechanisms we build. This meeting&rsquo;s goal is to review access rationales for quality and appropriateness, both by reviewing sampled rationales in the short-term, and identifying more automated mechanisms for identifying high-risk accesses to review in the future.</p> </li> <li> <p><strong>Exceptions must be granted in writing by CISO.</strong> While our overarching Engineering Strategy states that we follow an advisory architecture process as described in <em><a href="https://www.amazon.com/Facilitating-Software-Architecture-Empowering-Architectural-ebook/dp/B0DMHGWCPN/">Facilitating Software Architecture</a></em>, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the <code>#ciso</code> channel.</p> </li> </ul> <h2 id="diagnose">Diagnose</h2> <ul> <li> <p>We have a strong baseline of role-based access controls (RBAC) and audit logging. However, we have limited mechanisms for ensuring assigned roles follow the <a href="https://en.wikipedia.org/wiki/Principle_of_least_privilege">principle of least privilege</a>. This is particularly true in cases where individuals change teams or roles over the course of their tenure at the company: some individuals have collected numerous unused roles over five-plus years at the company.</p> <p>Similarly, our audit logs are durable and pervasive, but we have limited proactive mechanisms for identifying anomalous usage. Instead they are typically used to understand what occurred after an incident is identified by other mechanisms.</p> </li> <li> <p>For resource-level access controls, we rely on a hybrid approach between a 3rd-party platform for incoming user requests, and approval mechanisms within our own product. Providing a rationale for access across these two systems requires manual work, and those rationales are later manually reviewed for appropriateness in a batch fashion.</p> <p>There are two major ongoing problems with our current approach to resource-level access controls. First, the teams making requests view them as a burdensome obligation without much benefit to them or on behalf of the user. Second, because the rationale review steps are manual, there is no verifiable evidence of the quality of the review.</p> </li> <li> <p>We&rsquo;ve found no evidence of misuse of user data. When colleagues do access user data, we have uniformly and consistently found that there is a clear, and reasonable rationale for that access. For example, a ticket in the user support system where the user has raised an issue.</p> <p>However, the quality of our documented rationales is consistently low because it depends on busy people manually copying over significant information many times a day. Because the rationales are of low quality, the verification of these rationales is somewhat arbitrary. From a literal compliance perspective, we do provide rationales and auditing of these rationales, but it&rsquo;s unclear if the majority of these audits increase the security of our users&rsquo; data.</p> </li> <li> <p>Historically, we&rsquo;ve made significant security investments that caused temporary spikes in our security posture. However, looking at those initiatives a year later, in many cases we see a pattern of increased scrutiny, followed by a gradual repeal or avoidance of the new mechanisms.</p> <p>We have found that most of them involved increased friction for essential work performed by other internal teams. In the natural order of performing work, those teams would subtly subvert the improvements because it interfered with their immediate goals (e.g. supporting customer requests).</p> </li> <li> <p>As such, we have high conviction from our track record that our historical approach can create optical wins internally. We have limited conviction that it can create long-term improvements outside of significant, unlikely internal changes (e.g. colleagues are markedly less busy a year from now than they are today). It seems likely we need a new approach to meaningfully shift our stance on these kinds of problems.</p> </li> </ul> <h2 id="explore">Explore</h2> <p>Our experience is that best practices around managing internal access to user data are <a href="https://lethain.com/exploring-for-strategy/">widely available through our networks</a>, and otherwise hard to find. The exact rationale for this is hard to determine, but it seems possible that it&rsquo;s a topic that folks are generally uncomfortable discussing in public on account of potential future liability and compliance issues.</p> <p>In our exploration, we found two standardized dimensions (role-based access controls, audit logs), and one highly divergent dimension (resource-specific access controls):</p> <ul> <li> <p><strong>Role-based access controls</strong> (RBAC) are a highly standardized approach at this point. The core premise is that users are mapped to one or more roles, and each role is granted a certain set of permissions. For example, a role representing the customer support agent might be granted permission to deactivate an account, whereas a role representing the sales engineer might be able to configure a new account.</p> </li> <li> <p><strong>Audit logs</strong> are similarly standardized. All access and mutation of resources should be tied in a durable log to the human who performed the action. These logs should be accumulated in a centralized, queryable solution.</p> <p>One of the core challenges is determining how to utilize these logs proactively to detect issues rather than reactively when an issue has already been flagged.</p> </li> <li> <p><strong>Resource-level access controls</strong> are significantly less standardized than RBAC or audit logs. We found three distinct patterns adopted by companies, with little consistency across companies on which is adopted.</p> </li> </ul> <p>Those three patterns for resource-level access control were:</p> <ol> <li> <p><strong>3rd-party enrichment</strong> where access to resources is managed in a 3rd-party system such as Zendesk. This requires enriching objects within those systems with data and metadata from the product(s) where those objects live. It also requires implementing actions on the platform, such as archiving or configuration, allowing them to live entirely in that platform&rsquo;s permission structure.</p> <p>The downside of this approach is tight coupling with the platform vendor, any limitations inherent to that platform, and the overhead of maintaining engineering teams familiar with both your internal technology stack and the platform vendor&rsquo;s technology stack.</p> </li> <li> <p><strong>1st-party tool implementation</strong> where all activity, including creation and management of user issues, is managed within the core product itself. This pattern is most common in earlier stage companies or companies whose customer support leadership &ldquo;grew up&rdquo; within the organization without much exposure to the approach taken by peer companies.</p> <p>The advantage of this approach is that there is a single, tightly integrated and infinitely extensible platform for managing interactions. The downside is that you have to build and maintain all of that work internally rather than pushing it to a vendor that ought to be able to invest more heavily into their tooling.</p> </li> <li> <p><strong>Hybrid solutions</strong> where a 3rd-party platform is used for most actions, and is further used to permit resource-level access within the 1st-party system. For example, you might be able to access a user&rsquo;s data only while there is an open ticket created by that user, and assigned to you, in the 3rd-party platform.</p> <p>The advantage of this approach is that it allows supporting complex workflows that don&rsquo;t fit within the platform&rsquo;s limitations, and allows you to avoid complex coupling between your product and the vendor platform.</p> </li> </ol> <p>Generally, our experience is that all companies implement RBAC, audit logs, and one of the resource-level access control mechanisms. Most companies pursue either 3rd-party enrichment with a sizable, long-standing team owning the platform implementation, or rely on a hybrid solution where they are able to avoid a long-standing dedicated team by lumping that work into existing teams.</p>Our own agents with their own tools.https://lethain.com/our-own-agents-our-own-tools/Tue, 04 Feb 2025 04:00:00 -0700https://lethain.com/our-own-agents-our-own-tools/<p>Entering 2025, I decided to spend some time exploring the topic of agents. I started reading Anthropic&rsquo;s <a href="https://www.anthropic.com/research/building-effective-agents">Building effective agents</a>, followed by Chip Huyen&rsquo;s <em><a href="https://www.amazon.com/AI-Engineering-Building-Applications-Foundation/dp/1098166302">AI Engineering</a></em>. I kicked off a major workstream at work on using agents, and I also decided to do a personal experiment of sorts. This is a general commentary on building that project.</p> <p>What I wanted to build was a simple chat interface where I could write prompts, select models, and have the model use tools as appropriate. My side goal was to build this using Cursor and generally avoid writing code directly as much as possible, but I found that generally slower than writing code in emacs while relying on <code>4o-mini</code> to provide working examples to pull from.</p> <p>Similarly, while I initially envisioned building this in fullstack TypeScript via Cursor, I ultimately bailed into a stack that I&rsquo;m more comfortable, and ended up using Python3, FastAPI, PostgreSQL, and SQLAlchemy with the async psycopg3 driver. It&rsquo;s been a&hellip; while&hellip; since I started a brand new Python project, and used this project as an opportunity to get comfortable with Python3&rsquo;s async/await mechanisms along with Python3&rsquo;s typing along with <a href="https://mypy.readthedocs.io/">mypy</a>. Finally, I also wanted to experiment with <a href="https://tailwindcss.com/">Tailwind</a>, and ended up using <a href="https://tailwindui.com/components">TailwindUI&rsquo;s components</a> to build the site.</p> <p>The working version supports everything I wanted: creating chats with models, and allowing those models to use function calling to use tools that I provide. The models are allowed to call any number of tools in pursuit of the problem they are solving. The tool usage is the most interesting part here for sure. The simplest tool I created was a <code>get_temperature</code> tool that provided a fake temperature for your location. This allowed me to ask questions like &ldquo;What should I wear tomorrow in San Francisco, CA?&rdquo; and get a useful respond.</p> <p><img src="https://lethain.com/static/blog/2025/agent-temp.png" alt="Example of an agent responding to query about weather."></p> <p>The code to add this function to my project was pretty straightforward, just three lines of Python and 25 lines of metadata to pass to the OpenAI API.</p> <pre class="prettyprint">def tool_get_current_weather(location: str|None=None, format: str|None=None) -> str: "Simple proof of concept tool." temp = random.randint(40, 90) if format == 'fahrenheit' else random.randint(10, 25) return f"It's going to be {temp} degrees {format} tomorrow." FUNCTION_REGISTRY['get_current_weather'] = tool_get_current_weather TOOL_USAGE_REGISTRY['get_current_weather'] = { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } }</pre> <p>After getting this tool, the next tool I added was a simple URL retriever tool, which allowed the agent to grab a URL and use the content of that URL in its prompt.</p> <p><img src="https://lethain.com/static/blog/2025/agent-url.png" alt="An agent using a tool to retrieve the contents of a URL."></p> <p>The implementation for this tool was similarly quite simple.</p> <pre class="prettyprint">def tool_get_url(url: str|None=None) -> str: if url is None: return '' url = str(url) response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('main') or soup.find('article') or soup.body if not content: return str(response.content) markdown = markdownify(str(content), heading_style="ATX").strip() return str(markdown) FUNCTION_REGISTRY['get_url'] = tool_get_url TOOL_USAGE_REGISTRY['get_url'] = { "type": "function", "function": { "name": "get_url", "description": "Retrieve the contents of a website via its URL.", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "The complete URL, including protocol to retrieve. For example: \"https://lethain.com\"", } }, "required": ["url"], }, } }</pre> <p>What&rsquo;s pretty amazing is how much power you can add to your agent by adding such a trivial tool as retrieving a URL. You can similarly imagine adding tools for retrieving and commenting on Github pull requests and so, which could allow a very simple agent tool like this to become quite useful.</p> <p>Working on this project gave me a moderately compelling view of a near-term future where most engineers have simple application like this running that they can pipe events into from various systems (email, text, Github pull requests, calendars, etc), create triggers that map events to templates that feed into prompts, and execute those prompts with tool-aware agents.</p> <p>Combine that with ability for other agents to register themselves with you and expose the tools that they have access to (e.g. schedule an event with tool&rsquo;s owner), and a bunch of interesting things become very accessible with a very modest amount of effort:</p> <ul> <li>You could schedule events between two busy people&rsquo;s calendars, as if both of them had an assistant managing their calendar</li> <li>Reply to your own pull requests with new blog posts, providing feedback on typos and grammatical issues</li> <li>Crawl websites you care about and identify posts you might be interested in</li> <li>Ask the model to generate a system model using <a href="https://github.com/lethain/systems">lethain:systems</a>, run that model, then chart the responses</li> <li>Add a &ldquo;planning tool&rdquo; which allows the model to generate a plan to guide subsequent steps in a complex task. (e.g. getting my calendar, getting a friend&rsquo;s calendar, suggesting a time we could meet)</li> </ul> <p>None of these are exactly lifesaving, but each is somewhat useful, and I imagine there are many more fairly obvious ideas that become easy once you have the necessary scaffolding to make this sort of thing easy.</p> <p>Altogether, I think that I am convinced at this points that agents, using current foundational models, are going to create a number of very interesting experiences that improve our day to day lives in small ways that are, in aggregate, pretty transformational. I&rsquo;m less convinced that this is the way <em>all software</em> should work going forward though, but more thoughts on that over time. (A bunch of fun experiments happening at work, but early days on those.)</p>Is engineering strategy useful?https://lethain.com/is-engineering-strategy-useful/Thu, 30 Jan 2025 04:00:00 -0700https://lethain.com/is-engineering-strategy-useful/<p>While I frequently hear engineers bemoan a missing strategy, they rarely complete the thought by articulating why the missing strategy matters. Instead, it serves as more of a truism: the economy used to be better, children used to respect their parents, and engineering organizations used to have an engineering strategy.</p> <p>This chapter starts by exploring something I believe quite strongly: there&rsquo;s <em>always</em> an engineering strategy, even if there&rsquo;s nothing written down. From there, we&rsquo;ll discuss why strategy, especially written strategy, is such a valuable opportunity for organizations that take it seriously.</p> <p>We&rsquo;ll dig into:</p> <ul> <li>Why there&rsquo;s always a strategy, even when people say there isn&rsquo;t</li> <li>How strategies have been impactful across my career</li> <li>How inappropriate strategies create significant organizational pain without much compensating impact</li> <li>How written strategy drives organizational learning</li> <li>The costs of not writing strategy down</li> <li>How strategy supports personal learning and development, even in cases where you&rsquo;re not empowered to &ldquo;do strategy&rdquo; yourself</li> </ul> <p>By this chapter&rsquo;s end, hopefully you will agree with me that strategy is an undertaking worth investing your&ndash;and your organization&rsquo;s&ndash;time in.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="theres-always-a-strategy">There&rsquo;s always a strategy</h2> <p>I&rsquo;ve never worked somewhere where people didn&rsquo;t claim there as no strategy. In many of those companies, they&rsquo;d say there was no engineering strategy. Once I became an executive and was able to document and distribute an engineering strategy, accusations of missing strategy didn&rsquo;t go away, they just shifted to focus on a missing product or company strategy.</p> <p>This even happened at companies that definitively had engineering strategies like Stripe in 2016 which had numerous pillars to a clear engineering strategy such as:</p> <ul> <li>Maintain backwards API compatibility, at almost any cost (e.g. force an upgrade from TLS 1.2 to TLS 1.3 to retain PCI compliance, but don&rsquo;t force upgrades from the <a href="https://docs.stripe.com/api/charges/create">/v1/charges</a> endpoint to the <a href="https://docs.stripe.com/api/payment_intents">/v1/payment_intents</a> endpoint)</li> <li>Work in Ruby in a monorepo, unless it&rsquo;s the PCI environment, data processing, or data science work</li> <li>Engineers are fully responsible for the usability of their work, even when there are product or engineering managers involved</li> </ul> <p>Working there it was generally clear what the company&rsquo;s engineering strategy was on any given topic. That said, it sometimes required asking around, and over time certain decisions became sufficiently contentious that it became hard to definitively answer what the strategy was. For example, the adoption of Ruby versus Java became contentious enough that I distributed a strategy attempting to mediate the disagreement, <a href="https://lethain.com/magnitudes-of-exploration/">Magnitudes of exploration</a>, although it wasn&rsquo;t a particularly successful effort (for reasons that are obvious in hindsight, particularly the lack of any enforcement mechanism).</p> <p>In the same sense that William Gibson said &ldquo;The future is already here – it’s just not very evenly distributed,&rdquo; there is always a strategy embedded into an organization&rsquo;s decisions, although in many organizations that strategy is only visible to a small group, and may be quickly forgotten.</p> <p>If you ever find yourself thinking that a strategy doesn&rsquo;t exist, I&rsquo;d encourage you to instead ask yourself where the strategy lives if you can&rsquo;t find it. Once you do find it, you may also find that the strategy is quite ineffective, but I&rsquo;ve simply never found that it doesn&rsquo;t exist.</p> <h2 id="strategy-_is_-impactful">Strategy <em>is</em> impactful</h2> <p>In <a href="https://lethain.com/calm-product-eng-company/">&ldquo;We are a product engineering company!&rdquo;</a>, we discuss Calm&rsquo;s engineering strategy to address pervasive friction within the engineering team. The core of that strategy is clarifying how Calm makes major technology decisions, along with documenting the motivating goal steering those decisions: maximizing time and energy spent on creating their product.</p> <p>That strategy reduced friction by eliminating the cause of ongoing debate. It was successful in resetting the team&rsquo;s focus. It also caused several engineers to leave the company, because it was incompatible with their priorities. It&rsquo;s easy to view that as a downside, but I don&rsquo;t think it was. A clear, documented strategy made it clear to everyone involved what sort of game we were playing, the rules for that game, and for the first time let them accurately decide if they wanted to be part of that game with the wider team.</p> <p>Creating alignment is one of the ways that strategy makes an impact, but it&rsquo;s certainly not the only way. Some of the ways that strategies support the organization are:</p> <ul> <li> <p><strong>Concentrating company investment into a smaller space.</strong></p> <p>For example, <a href="https://lethain.com/decompose-monolith-strategy/">deciding not to decompose a monolith</a> allows you to invest the majority of your tooling efforts on one language, one test suite, and one deployment mechanism.</p> </li> <li> <p><strong>Many interesting properties only available through universal adoption.</strong></p> <p>For example, moving to an <a href="https://lethain.com/engineering-cost-model/">&ldquo;N-1 policy&rdquo; on backfilled roles</a> is a significant opportunity for managing costs, but only works if consistently adopted. As another example, many strategies for disaster recovery or multi-region are only viable if all infrastructure has a common configuration mechanism.</p> </li> <li> <p><strong>Focus execution on what truly matters.</strong></p> <p>For example, <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration</a> strategy allowed a four engineer team to migrate a thousand services operated by two thousand engineers to a new provisioning and orchestration platform in less than a year. This was an extraordinarily difficult project, and was only possible because of clear thinking.</p> </li> <li> <p><strong>Creating a knowledge repository of how your organization thinks.</strong> Onboarding new hires, particularly senior new hires, is much more effective with documented strategy.</p> <p>For example, most industry professionals today have a strongly held opinion on <a href="https://lethain.com/llm-adoption-strategy/">how to adopt large language models</a>. New hires will have a strong opinion as well, but they&rsquo;re unlikely to share your organization&rsquo;s opinion unless there&rsquo;s a clear document they can read to understand it.</p> </li> </ul> <p>There are some things that a strategy, even a cleverly written one, cannot do. However, it&rsquo;s always been my experience that developing a strategy creates progress, even if the progress is understanding the inherent disagreement preventing agreement.</p> <h2 id="inappropriate-strategy-is-especially-impactful">Inappropriate strategy is especially impactful</h2> <p>While good strategy can accomplish many things, it sometimes feels that inappropriate strategy is far more impactful. Of course, impactful in all the wrong ways. <a href="https://lethain.com/digg-v4/">Digg V4</a> remains the worst considered strategy I&rsquo;ve personally participated in. It was a complete rewrite of the Digg V3.5 codebase from a PHP monolith to a PHP frontend and backend of a dozen Python services. It also moved the database from sharded MySQL to an early version of Cassandra. Perhaps worst, it replaced the nuanced algorithms developed over a decade with a hack implemented a few days before launch.</p> <p>Although it&rsquo;s likely Digg would have struggled to become profitable due to its reliance on search engine optimization for traffic, and Google&rsquo;s frequently changing search algorithm of that era, the engineering strategy ensured we died fast rather than having an opportunity to dig our way out.</p> <p>Importantly, it&rsquo;s not just Digg. Almost every engineering organization you drill into will have its share of unused platform projects that captured decades of engineering years to the detriment of an important opportunity. A shocking number of senior leaders join new companies and initiate a <a href="https://lethain.com/grand-migration/">grand migration</a> that attempts to entirely rewrite the architecture, switch programming languages, or otherwise shift their new organization to resemble a prior organization where they understood things better.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Inappropriate versus bad</strong></p> <p>When I first wrote this section, I just labeled this sort of strategy as &ldquo;bad.&rdquo; The challenge with that term is that the same strategy might well be very effective in a different set of circumstances. For example, if Digg had been a three person company with no revenue, rewriting from scratch could have the right decision!</p> <p>As a result, I&rsquo;ve tried to prefer the term &ldquo;inappropriate&rdquo; rather than &ldquo;bad&rdquo; to avoid getting caught up on whether a given approach <em>might</em> work in other circumstances. Every approach undoubtedly works in <em>some</em> organization.</p> </div> <h2 id="written-strategy-drives-organizational-learning">Written strategy drives organizational learning</h2> <p>When I joined Carta, I noticed we had an inconsistent approach to a number of important problems. Teams had distinct standard kits for how they approached new projects. Adoption of existing internal platforms was inconsistent, as was decision making around funding new internal platforms. There was widespread agreement that we were <a href="https://lethain.com/decompose-monolith-strategy/">decomposing our monolith</a>, but no agreement on how we were doing it.</p> <p>Coming into such a <a href="https://lethain.com/when-write-down-engineering-strategy/">permissive strategy</a> environment, with strong, differing perspectives on the ideal path forward, one of my first projects was writing down an explicit engineering strategy along with our newly formed Navigators team, itself a part of our new engineering strategy.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Navigators at Carta</strong></p> <p>As discussed in <a href="https://lethain.com/navigators/">Navigators</a>, we developed a program at Carta to have explicitly named individual contributor, technical leaders to represent key parts of the engineering organization. This representative leadership group made it possible to iterate on strategy with a small team of about ten engineers that represented the entire organization, rather than take on the impossible task of negotiating with 400 engineers directly.</p> </div> <p>This written strategy made it possible to explicitly describe the problems we saw, and how we wanted to navigate those problems. Further, it was an artifact that we were able to iterate on in a small group, but then share widely for feedback from teams we might have missed.</p> <p>After initial publishing, we shared it widely and talked about it frequently in engineering all-hands meetings. Then we came back to it each year, or when things stopped making much sense, and revised it. As an example, our initial strategy didn&rsquo;t talk about artificial intelligence at all. A few months later, we extended it to mention a very conservative approach to using Large Language Models. Most recently, we&rsquo;ve revised the artificial intelligence portion again, as we dive deeply into <a href="https://huyenchip.com//2025/01/07/agents.html">agentic workflows</a>.</p> <p>A lot of people have disagreed with parts of the strategy, which is great: that&rsquo;s one of the key benefits of a written strategy, it&rsquo;s possible to precisely disagree. From that disagreement, we&rsquo;ve been able to evolve our strategy. Sometimes because there&rsquo;s new information like the current rapid evolution of artificial intelligence practices, and other times because our initial approach could be improved like in how we gated membership of the initial Navigators team.</p> <p>New hires are able to disagree too, and do it from an informed place rather than coming across as attached to their prior company&rsquo;s practices. In particular, they&rsquo;re able to understand the historical thinking that motivated our decisions, even when that context is no longer obvious. At the time we paused decomposition of our monolith, there was significant friction in service provisioning, but that&rsquo;s far less true today, which makes the decision seem a bit arbitrary. Only the written document can consistently communicate that context across a growing, shifting, and changing organization.</p> <p>With oral history, what you believe is highly dependent on who you talk with, which shapes your view of history and the present. With written history, it&rsquo;s far more possible to agree at scale, which is the prerequisite to growing at scale rather than isolating growth to small pockets of senior leadership.</p> <h2 id="the-cost-of-implicit-strategy">The cost of implicit strategy</h2> <p>We just finished talking about written strategy, and this book spends a lot of time on this topic, including <a href="https://lethain.com/readable-engineering-strategy-documents/">a chapter on how to structure strategies to maximize readability</a>. It&rsquo;s not <em>just</em> because of the positives created by written strategy, but also because of the damage unwritten strategy creates.</p> <ul> <li> <p><strong>Vulnerable to misinterpretation.</strong></p> <p>Information flow in verbal organizations depends on an individual being in a given room for a decision, and then accurately repeating that information to the others who need it. However, it&rsquo;s common to see those individuals fail to repeat that information elsewhere. Sometimes their interpretation is also faulty to some degree. Both of these create significant problems in operating strategy.</p> </li> </ul> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Two-headed organizations</strong></p> <p>Some years ago, I started moving towards a model where most engineering organizations I worked with have two leaders: one who&rsquo;s a manager, and another who is a senior engineer. This was partially to ensure engineering context was included in senior decision making, but it was also to reduce communication errors.</p> <p>Errors in point-to-point communication are so prevalent when done one-to-one, that the only solution I could find for folks who weren&rsquo;t reading-oriented communicators was ensuring I had communicated strategy (and other updates) to at least two people.</p> </div> <ul> <li> <p><strong>Inconsistency across teams.</strong></p> <p>At one company I worked in, promotions to Staff-plus role happened at a much higher rate in the infrastructure engineering organization than the product engineering team. This created a constant drain out of product engineering to work on infrastructure shaped problems, even if those problems weren&rsquo;t particularly valuable to the business.</p> <p>New leaders had no idea this informal policy existed, and they would routinely run into trouble in <a href="https://lethain.com/perf-management-system/">calibration discussions</a>. They <em>also</em> weren&rsquo;t aware they needed to go argue for a better policy. Worse, no one was sure if this was a real policy or not, so it was ultimately random whether this perspective was represented for any given promotion: sometimes good promotions would be blocked, sometimes borderline cases would be approved.</p> </li> <li> <p><strong>Inconsistency over time.</strong></p> <p>Implementing a new policy tends to be a mix of persistent and one-time actions. For example, let&rsquo;s say you wanted to standardize all HTTP operations to use the same library across your codebase. You might add a linter check to reject known alternatives, and you&rsquo;ll probably do a one-time pass across your codebase standardizing on that library.</p> <p>However, two years later there are another three random HTTP libraries in your codebase, creeping into the cracks surrounding your linting. If the policy is written down, and a few people read it, then there&rsquo;s a number of ways this could be nonetheless prevented. If it&rsquo;s not written down, it&rsquo;s much less likely someone will remember, and much more likely they won&rsquo;t remember the rationale well enough to argue about it.</p> </li> <li> <p><strong>Hazard to new leadership.</strong></p> <p>When a new Staff-plus engineer or executive joins a company, it&rsquo;s common to blame them for failing to understand the existing context behind decisions. That&rsquo;s fair: a big part of senior leadership is uncovering and understanding context. It&rsquo;s also unfair: explicit documentation of prior thinking would have made this much easier for them.</p> <p>Every particularly bad new-leader onboarding that I&rsquo;ve seen has involved a new leader coming into an unfilled role, that the new leader&rsquo;s manager didn&rsquo;t know how to do. In those cases, success is entirely dependent on that new leader&rsquo;s ability and interest in learning.</p> </li> </ul> <p>In most ways, the practice of documenting strategy has a lot in common with <a href="https://lethain.com/succession-planning/">succession planning</a>, where the full benefits accrue to the organization rather than to the individual doing it. It&rsquo;s possible to maintain things when the original authors are present, appreciating the value requires stepping outside yourself for a moment to value things that will matter most to the organization when you&rsquo;re no longer a member.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Information herd immunity</strong></p> <p>A frequent objection to written strategy is that no one reads anything. There&rsquo;s some truth to this: it&rsquo;s extremely hard to get everyone in an organization to know something. However, I&rsquo;ve never found that goal to be particularly important.</p> <p>My view of information dispersal in an organization is the same as <a href="https://en.wikipedia.org/wiki/Herd_immunity">Herd immunity</a>: you don&rsquo;t need everyone to know something, just to have enough people who know something that confusion doesn&rsquo;t propagate too far.</p> <p>So, it may be impossible for all engineers to know strategy details, but you certainly can have every Staff-plus engineer and engineering manager know those details.</p> </div> <h2 id="strategy-supports-personal-learning">Strategy supports personal learning</h2> <p>While I believe that the largest benefits of strategy accrue to the organization, rather than the individual creating it, I also believe that strategy is an underrated avenue for self-development.</p> <p>The ways that I&rsquo;ve seen strategy support personal development are:</p> <ul> <li> <p><strong>Creating strategy builds self-awareness.</strong></p> <p>Starting with a concrete example, I&rsquo;ve worked with several engineers who viewed themselves as extremely senior, but frequently demanded that projects were implemented using new programming languages or technologies because they personally wanted to learn about the technology. Their internal strategy was clear&ndash;they wanted to work on something fun&ndash;but following <a href="https://lethain.com/components-of-eng-strategy/">the steps to build an engineering strategy</a> would have created a strategy that even they agreed didn&rsquo;t make sense.</p> </li> <li> <p><strong>Strategy supports situational awareness in new environments.</strong></p> <p><a href="https://lethain.com/wardley-mapping/">Wardley mapping</a> talks a lot about situational awareness as a prerequisite to good strategy. This is ensuring you understand the realities of your circumstances, which is the most destructive failure of new senior engineering leaders. By explicitly stating the diagnosis where the strategy applied, it makes it easier for you to debug why reusing a prior strategy in a new team or company might not work.</p> </li> <li> <p><strong>Strategy as your personal archive.</strong></p> <p>Just as documented strategy is institutional memory, it also serves as personal memory to understand the impact of your prior approaches. Each of us is an archivist of our prior work, pulling out the most valuable pieces to address the problem at hand. Over a long career, memory fades&ndash;and motivated reasoning creeps in&ndash;but explicit documentation doesn&rsquo;t.</p> </li> </ul> <p>Indeed, part of the reason I started working on this book <em>now</em> rather than later is that I realized I was starting to forget the details of the strategy work I did earlier in my career. If I wanted to preserve the wisdom of that era, and ensure I didn&rsquo;t have to relearn the same lessons in the future, I had to write it now.</p> <h2 id="summary">Summary</h2> <p>We&rsquo;ve covered why strategy can be a valuable learning mechanism for both your engineering organization and for you. We&rsquo;ve shown how strategies have helped organizations deal with service migrations, monolith decomposition, and right-sizing backfilling. We&rsquo;ve also discussed how inappropriate strategy contributed to Digg&rsquo;s demise.</p> <p>However, if I had to pick two things to emphasize as this chapter ends, it wouldn&rsquo;t be any of those things. Rather, it would be two themes that I find are the most frequently ignored:</p> <ol> <li>There&rsquo;s always a strategy, even if it isn&rsquo;t written down.</li> <li>The single biggest act you can take to further strategy in your organization is to write down strategy so it can be debated, agreed upon, and explicitly evolved.</li> </ol> <p>Discussions around topics like strategy often get caught up in high prestige activities like making controversial decisions, but the most effective strategists I&rsquo;ve seen make more progress by actually performing the basics: writing things down, exploring widely to see how other companies solve the same problem, accepting feedback into their draft from folks who disagree with them. Strategy <em>is</em> useful, and doing strategy can be simple, too.</p>"We're a product engineering company!" -- Engineering strategy at Calm.https://lethain.com/calm-product-eng-company/Thu, 23 Jan 2025 06:00:00 -0700https://lethain.com/calm-product-eng-company/<p>In my career, the majority of the strategy work I&rsquo;ve done has been in non-executive roles, things like <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration</a>. Joining Calm was my first executive role, where I was able to not just propose, but also mandate, strategy.</p> <p>Like almost all startups, the engineering team was scattered when I joined. Was our most important work creating more scalable infrastructure? Was our greatest risk the failure to adopt leading programming languages? How did we rescue the stuck <a href="https://lethain.com/decompose-monolith-strategy/">service decomposition initiative</a>?</p> <p>This strategy is where the engineering team and I aligned after numerous rounds of iteration, debate, and inevitably some disagreement. As a strategy, it&rsquo;s both basic and also unambiguous about what we valued, and I believe it&rsquo;s a reasonably good starting point for any <a href="https://lethain.com/quality/">low scalability-complexity</a> consumer product.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reverse order, starting with <em>Explore</em>, then <em>Diagnose</em> and so on. Relative to the default structure, this document has one tweak, folding the <em>Operation</em> section in with <em>Policy</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy--operation">Policy &amp; Operation</h2> <p>Our new policies, and the mechanisms to operate them are:</p> <ul> <li> <p><strong>We are a product engineering company.</strong> Users write in every day to tell us that our product has changed their lives for the better. Our technical infrastructure doesn&rsquo;t get many user letters&ndash;and this is unlikely to change going forward as our infrastructure is relatively low-scale and low-complexity. Rather than attempting to change that, we want to devote the absolute maximum possible attention to product engineering.</p> </li> <li> <p><strong>We exclusively adopt new technologies to create valuable product capabilities.</strong> We believe our technology stack as it exists today can solve the majority of our current and future product roadmaps. In the rare case where we adopt a new technology, we do so because a product capability is inherently impossible without adopting a new technology.</p> <p>We do not adopt new technologies for other reasons. For example, we would not adopt a new technology because someone is interested in learning about it. Nor would we adopt a technology because it is 30% <em>better suited</em> to a task.</p> </li> <li> <p><strong>We write all code in the monolith.</strong> It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code <em>must</em> be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith.</p> <p>In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below.</p> </li> <li> <p><strong>Exceptions are granted by the CTO, and must be in writing.</strong> The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we&rsquo;ll create negative leverage for Calm rather than serving as the engine that advances our product.</p> <p>All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there&rsquo;s no written record that the CTO approved it, then it&rsquo;s not approved.</p> </li> </ul> <p>Proving the point about exceptions, there are two confirmed exceptions to the above strategy:</p> <ol> <li> <p><strong>We are incrementally migrating to TypeScript.</strong> We have found that static typing can prevent a number of our user-facing bugs. TypeScript provides a clean, incremental migration path for our JavaScript codebase, and we aim to migrate the entirety over the next six months.</p> <p>Our Web engineering team is leading this migration.</p> </li> <li> <p><strong>We are evaluating Postgres Aurora as our primary database.</strong> Many of our recent production incidents are caused by index scans for tables with high write velocity such as tracking customer logins. We believe Aurora will perform better under these workloads.</p> <p>Our Infrastructure engineering team is leading this initiative.</p> </li> </ol> <h2 id="diagnose">Diagnose</h2> <p>The current state of our engineering organization:</p> <ul> <li> <p><strong>Our product is not limited by missing infrastructure capabilities.</strong> Reviewing our roadmap, there&rsquo;s nothing that we are trying to build today or over the next year that is constrained by our technical infrastructure.</p> </li> <li> <p><strong>Our uptime, stability and latency are OK but not great.</strong> We have semi-frequent stability and latency issues in our application, all of which are caused by one of two issues. First, deploying new code with a missing index because it performed well enough in a test environment. Second, writes to a small number of extremely large, skinny tables have become expensive in combination with scans over those tables&rsquo; indexes.</p> </li> <li> <p><strong>Our infrastructure team is split between supporting monolith and service workflows.</strong> One way to measure technical debt is to understand how much time the team is spending propping up the current infrastructure. Today, that is meaningful but not overwhelming work for our team of three infrastructure engineers supporting 30 product engineers.</p> <p>However, we <em>are</em> finding infrastructure engineers increasingly pulled into debugging incidents for components moved out of the central monolith into our service architecture. This is partially due to increased inherent complexity, but it&rsquo;s more due to exposing lack of monitoring and ambiguous accountability in services&rsquo; production incidents.</p> </li> <li> <p><strong>Our product and executive stakeholders experience us as competing factions.</strong> Engineering exists to build and operate software in the company. Part of that is being easy to work with. We should not necessarily support every ask from Product if we believe they are misaligned with Engineering&rsquo;s goals (e.g. maintaining security), but it should generally provide a consistent perspective across our team.</p> <p>Today, our stakeholders believe they will get radically different answers to basic questions of capabilities and approach depending on who they ask. If they try to get a group of engineers to agree on an approach, they often find we derail into debate about approach rather than articulating a clear point of view that allows the conversation to move forward.</p> </li> <li> <p><strong>We&rsquo;re arguing a particularly large amount about adopting new technologies and rewrites.</strong> Most of our disagreements stem around adopting new technologies or rewriting existing components into new technology stacks. For example, can we extend this feature or do we have to migrate it to a service before extending it? Can we add this to our database or should we move it into a new Redis cache instead? Is JavaScript a sufficient programming language, or do we need to rewrite this functionality in Go?</p> <p>This is particularly relevant to next steps around the ongoing services migration, which has been in-flight for over a year, but is yet to move any core production code.</p> </li> <li> <p><strong>We are spending more time on infrastructure and platform work than product work.</strong> This is the combination of all the above issues, from the stability issues we are encountering in our database design, to the lack of engineering alignment on execution. This places us at odds with stakeholder expectation that we are predominantly focused on new product development.</p> </li> </ul> <h2 id="explore">Explore</h2> <p>Calm is a mobile application that guides users to build and maintain either a meditation or sleep habit. Recommendations and guidance across content is individual to the user, but the content is shared across all customers and is amenable to caching on a content delivery network (CDN). As long as the CDN is available, the mobile application can operate despite inability to access servers (e.g. the application remains usable from a user&rsquo;s perspective, even if the non-CDN production infrastructure is unreachable).</p> <p>In 2010, enabling a product of this complexity would have required significant bespoke infrastructure, along with likely maintaining a physical presence in a series of datacenters to run your software. In 2020, comparable applications are generally moving towards maintaining as little internal infrastructure as possible. This perspective is summarized effectively in Intercom&rsquo;s <a href="https://www.intercom.com/blog/run-less-software/">Run Less Software</a> and Dan McKinley&rsquo;s <a href="https://mcfunley.com/choose-boring-technology">Choose Boring Technology</a>.</p> <p>New companies founded in this space view essentially all infrastructure as a commodity bought off your cloud provider. This even extends to areas of innovation, such as machine learning, where the training infrastructure is typically run on an offering like AWS Bedrock, and the model infrastructure is provided by Anthropic or OpenAI.</p>Bridging theory and practice in engineering strategy.https://lethain.com/bridging-eng-strategy-theory-and-practice/Thu, 16 Jan 2025 04:00:00 -0700https://lethain.com/bridging-eng-strategy-theory-and-practice/<p>Some people I&rsquo;ve worked with have lost hope that engineering strategy actually exists within <em>any</em> engineering organizations. I imagine that they, reading through the <a href="https://lethain.com/components-of-eng-strategy/">steps to build engineering strategy</a>, or the <a href="https://lethain.com/private-equity-strategy/">strategy for navigating private equity ownership</a>, are not impressed. Instead, these ideas probably come across as theoretical at best. In less polite company, they might describe these ideas as fake constructs.</p> <p>Let&rsquo;s talk about it! Because they&rsquo;re right. In fact, they&rsquo;re right in two different ways. First, this book is focused on explaining how to create clean, refine and definitive strategy documents, where initially most real strategy artifacts look rather messy. Second, applying these techniques in practice can require a fair amount of creativity. It might sound easy, but it&rsquo;s quite difficult in practice.</p> <p>This chapter will cover:</p> <ul> <li>Why strategy documents need to be clear and definitive, especially when strategy development has been messy</li> <li>How to iterate on strategy when there are demands for unrealistic timelines</li> <li>Using strategy as non-executives, where others might override your strategy</li> <li>Handling dynamic, quickly changing environments where diagnosis can change frequently</li> <li>Working with indecisive stakeholders who don&rsquo;t provide clarity on approach</li> <li>Surviving other people&rsquo;s bad strategy work</li> </ul> <p>Alright, let&rsquo;s dive into the many ways that praxis doesn&rsquo;t quite line up with theory.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="clear-and-definitive-documents">Clear and definitive documents</h2> <p>As explored in <a href="https://lethain.com/readable-engineering-strategy-documents/">Making engineering strategies more readable</a>, documents that feel intuitive to write are often fairly difficult to read. That&rsquo;s because thinking tends to be a linear-ish journey from a problem to a solution. Most readers, on the other hand, usually just want to know the solution and then to move on. That&rsquo;s because good strategies are read for direction (e.g. when a team wants to understand how they&rsquo;re supposed to solve a specific issue at hand) far more frequently than they&rsquo;re read to build agreement (e.g. building stakeholder alignment during the initial development of the strategy).</p> <p>However, many organizations only produce writer-oriented strategy documents, and may not have any reader-oriented documents at all. If you&rsquo;ve predominantly worked in those sorts of organizations, then the first reader-oriented documents you encounter will seem artificial.</p> <p>There are also organizations that have many reader-oriented documents, but omit the rationale behind those documents. Those documents feel prescriptive and heavy-handed, because the infrequent reader who <em>does</em> want to understand the thinking can&rsquo;t find it. Further, when they want to propose an alternative, they have to do so without the rationale behind the current policies: the absence of that context often transforms what was a collaborative problem-solving opportunity into a political match.</p> <p>With that in mind, I&rsquo;d encourage you to see the frequent absence of these documents as a major opportunity to drive strategy within your organization, rather than evidence that these documents don&rsquo;t work. My experience is that they do.</p> <h2 id="doing-strategy-despite-unrealistic-timelines">Doing strategy despite unrealistic timelines</h2> <p>The most frequent failure mode I see for strategy is when it&rsquo;s rushed, and its authors accept that thinking must stop when the artificial deadline is reached. Taking annual planning at Stripe as an example, <a href="https://www.amazon.com/Scaling-People-Tactics-Management-Building/dp/1953953212/">Claire Hughes Johnson</a> argued that planning expands to fit any timeline, and consequently set a short planning timeline of several weeks. Some teams accepted that as a fixed timeline and <em>stopped planning</em> when the timeline ended, whereas effective teams never stopped planning before or after the planning window.</p> <p>When strategy work is given an artificially or unrealistic timeline, then you should deliver the best draft you can. Afterwards, rather than being finished, you should view yourself as <a href="https://lethain.com/refining-eng-strategy/">starting the refinement process</a>. An open strategy secret is that many strategies never leave the refinement phase, and continue to be tweaked throughout their lifespan. Why should a strategy with an early deadline be any different?</p> <p>Well, there is one important problem to acknowledge: I&rsquo;ve often found that the executive who initially provided the unrealistic timeline intended it as a forcing function to inspire action and quick thinking. If you have a discussion with them directly, they&rsquo;re usually quite open to adjusting the approach. However, the intermediate layers of leadership between that executive and you often calcify on a particular approach which they claim that the executive insists on precisely following.</p> <p>Sometimes having the conversation with the responsible executive is quite difficult. In that case, you do have to work with individuals taking the strategy as literal and unalterable until either you can have the conversation or something goes wrong enough that the executive starts paying attention again. Usually, though, you can find someone who has a communication path, as long as you can articulate the issue clearly.</p> <h2 id="using-strategy-as-non-executives">Using strategy as non-executives</h2> <p>Some engineers will argue that the only valid <a href="https://lethain.com/when-write-down-engineering-strategy/">strategy altitude</a> is the highest one defined by executives, because any other strategy can be invalidated by a new, higher altitude strategy. They would claim that teams simply <em>cannot</em> do strategy, because executives might invalidate it. Some engineering executives would argue the same thing, instead claiming that they can&rsquo;t work on an engineering strategy because the missing product strategy or business strategy might introduce new constraints.</p> <p>I don&rsquo;t agree with this line of thinking at all. To do strategy at any altitude, you have to come to terms with the certainty that new information will show up, and you&rsquo;ll need to revise your strategy to deal with that.</p> <p><a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service provisioning strategy</a> is a good counterexample against the idea that you have to wait for someone else to set the strategy table. We were able to find a durable diagnosis despite being a relatively small team within a much larger organization that was relatively indifferent to helping us succeed. When it comes to using strategy, effective diagnosis trumps authority. In my experience, at least as many executives&rsquo; strategies are ravaged by reality&rsquo;s pervasive details as are overridden by higher altitude strategies. The only way to be certain your strategy will fail is waiting until you&rsquo;re certain that no new information might show up and require it changing.</p> <h2 id="doing-strategy-in-chaotic-environments">Doing strategy in chaotic environments</h2> <p><a href="https://lethain.com/llm-adoption-strategy/">How should you adopt LLMs?</a> discusses how a company should plot a path through the rapidly evolving LLM ecosystem. Periods of rapid technology evolution are one reason why your strategy might encounter a pocket of chaos, but there are many others. Pockets of rapid hiring, as well as layoffs, create chaos. The departure of load-bearing senior leaders can change a company quickly. Slowing revenue in a company&rsquo;s core business can also initiate chaotic actions in pursuit of a new business.</p> <p>Strategies don&rsquo;t require stable environments. Instead, strategies require awareness of the environment that they&rsquo;re operating in. In a stable period, a strategy might expect to run for several years and expect relatively little deviation from the initial approach. In a dynamic period, the strategy might know you can only protect capacity in two-week chunks before a new critical initiative pops up. It&rsquo;s possible to good strategy in either scenario, but it&rsquo;s impossible to good strategy if you don&rsquo;t diagnose the context effectively.</p> <h2 id="unreliable-information">Unreliable information</h2> <p>Often times, the strategy forward is very obvious if a few key decisions were made, you know who is supposed to make those decisions, but you simply cannot get them to decide. My most visceral experience of this was conducting a layoff where the CEO wouldn&rsquo;t define a target cost reduction or a thesis of how much various functions (e.g. engineering, marketing, sales) should contribute to those reductions. With those two decisions, engineering&rsquo;s approach would be obvious, and without that clarity things felt impossible.</p> <p>Although I was frustrated at the time, I&rsquo;ve since come to appreciate that missing decisions are the norm rather than the exception. The strategy on <a href="https://lethain.com/private-equity-strategy/">Navigating Private Equity ownership</a> deals with this problem by acknowledging a missing decision, and expressly blocking one part of its execution on that decision being made. Other parts of its plan, like changing how roles are backfilled, went ahead to address the broader cost problem.</p> <p>Rather than blocking on missing information, your strategy should acknowledge what&rsquo;s missing, and move forward where you can. Sometimes that&rsquo;s moving forward by taking risk, sometimes that&rsquo;s delaying for clarity, but it&rsquo;s never accepting yourself as stuck without options other than pointing a finger.</p> <h2 id="surviving-other-peoples-bad-strategy-work">Surviving other people&rsquo;s bad strategy work</h2> <p>Sometimes you will be told to follow something which is described as a strategy, but is really just a policy without any strategic thinking behind it. This is an unavoidable element of working in organizations and happens for all sorts of reasons. Sometimes, your organization&rsquo;s leader doesn&rsquo;t believe it&rsquo;s valuable to explain their thinking to others, because they see themselves as the one important decision maker.</p> <p>Other times, your leader doesn&rsquo;t agree with a policy they&rsquo;ve been instructed to rollout. Adoption of &ldquo;high hype&rdquo; technologies like blockchain technologies during the crypto book was often top-down direction from company leadership that engineering disagreed with, but was obligated to align with. In this case, your leader is finding that it&rsquo;s hard to explain a strategy that they themselves don&rsquo;t understand either.</p> <p>This is a frustrating situation. What I&rsquo;ve found most effective is writing a strategy of my own, one that acknowledges the broader strategy I disagree with in its diagnosis as a static, unavoidable truth. From there, I&rsquo;ve been able to make practical decisions that recognize the context, even if it&rsquo;s not a context I&rsquo;d have selected for myself.</p> <h2 id="summary">Summary</h2> <p>I started this chapter by acknowledging that the <a href="https://lethain.com/components-of-eng-strategy/">steps to building engineering strategy</a> are a theory of strategy, and one that can get quite messy in practice. Now you know why strategy documents often come across as overly pristine&ndash;because they&rsquo;re trying to communicate clearly about a complex topic.</p> <p>You also know how to navigate the many ways reality pulls you away from perfect strategy, such as unrealistic timelines, higher altitude strategies invalidating your own strategy work, working in a chaotic environment, and dealing with stakeholders who refuse to align with your strategy. Finally, we acknowledged that sometimes strategy work done by others is not what we&rsquo;d consider strategy, it&rsquo;s often unsupported policy with neither a diagnosis nor an approach to operating the policy.</p> <p>That&rsquo;s all stuff you&rsquo;re going to run into, and it&rsquo;s all stuff you&rsquo;re going to overcome on the path to doing good strategy work.</p>Uber's service migration strategy circa 2014.https://lethain.com/uber-service-migration-strategy/Thu, 09 Jan 2025 06:00:00 -0700https://lethain.com/uber-service-migration-strategy/<p>In early 2014, I joined as an engineering manager for Uber&rsquo;s Infrastructure team. We were responsible for a wide number of things, including provisioning new services. While the overall team I led grew significantly over time, the subset working on service provisioning never grew beyond four engineers.</p> <p>Those four engineers successfully migrated 1,000+ services onto a new, future-proofed service platform. More importantly, they did it while absorbing the majority, although certainly not the entirety, of the migration workload onto that small team rather than spreading it across the 2,000+ engineers working at Uber at the time. Their strategy serves as an interesting case study of how a team can drive strategy, even without any executive sponsor, by focusing on solving a pressing user problem, and providing effective ergonomics while doing so.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>Note that after this introductory section, the remainder of this strategy will be written from the perspective of 2014, when it was originally developed.</p> </div> <p>More than a decade later after this strategy was implemented, we have an interesting perspective to evaluate its impact. It&rsquo;s fair to say that it had some meaningful, negative consequences by allowing the widespread proliferation of new services within Uber. Those services contributed to a messy architecture that had to go through cycles of internal cleanup over the following years.</p> <p>As the principle author of this strategy, I&rsquo;ve learned a lot from meditating on the fact that this strategy was wildly successful, that I think Uber is better off for having followed it, and that it also meaningfully degraded Uber&rsquo;s developer experience over time. There&rsquo;s both good and bad here; with a wide enough lens, all evaluations get complicated.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="reading-this-document">Reading this document</h2> <p>To apply this strategy, start at the top with <em>Policy</em>. To understand the thinking behind this strategy, read sections in reserve order, starting with <em>Explore</em>, then <em>Diagnose</em> and so on. Relative to the default structure, this document one tweak, folding the <em>Operation</em> section in with <em>Policy</em>.</p> <p>More detail on this structure in <a href="https://lethain.com/readable-engineering-strategy-documents">Making a readable Engineering Strategy document</a>.</p> <h2 id="policy--operation">Policy &amp; Operation</h2> <p>We&rsquo;ve adopted these guiding principles for extending Uber&rsquo;s service platform:</p> <ul> <li> <p><strong>Constrain manual provisioning allocation to maximize investment in self-service provisioning.</strong> The service provisioning team will maintain a fixed allocation of one full time engineer on manual service provisioning tasks. We will move the remaining engineers to work on automation to speed up future service provisioning. This will degrade manual provisioning in the short term, but the alternative is permanently degrading provisioning by the influx of new service requests from newly hired product engineers.</p> </li> <li> <p><strong>Self-service must be safely usable by a new hire without Uber context.</strong> It is possible today to make a Puppet or Clusto change while provisioning a new service that negatively impacts the production environment. This must not be true in any self-service solution.</p> </li> <li> <p><strong>Move to structured requests, and out of tickets.</strong> Missing or incorrect information in provisioning requests create significant delays in provisioning. Further, collecting this information is the first step of moving to a self-service process. As such, we can get paid twice by reducing errors in manual provisioning while also creating the interface for self-service workflows.</p> </li> <li> <p><strong>Prefer initializing new services with good defaults rather than requiring user input.</strong> Most new services are provisioned for new projects with strong timeline pressure but little certainty on their long-term requirements. These users cannot accurately predict their future needs, and expecting them to do so creates significant friction.</p> <p>Instead, the provisioning framework should suggest good defaults, and make it easy to change the settings later when users have more clarity. The gate from development environment to production environment is a particularly effective one for ensuring settings are refreshed.</p> </li> </ul> <p>We are materializing those principles into this sequenced set of tasks:</p> <ol> <li> <p>Create an internal tool that coordinates service provisioning, replacing the process where teams request new services via Phabricator tickets. This new tool will maintain a schema of required fields that must be supplied, with the aim of eliminating the majority of back and forth between teams during service provisioning.</p> <p>In addition to capturing necessary data, this will also serve as our interface for automating various steps in provisioning without requiring future changes in the workflow to request service provisioning.</p> </li> <li> <p>Extend the internal tool will generate Puppet scaffolding for new services, reducing the potential for errors in two ways. First, the data supplied in the service provisioning request can be directly included into the rendered template. Second, this will eliminate most human tweaking of templates where typo&rsquo;s can create issues.</p> </li> <li> <p>Port allocation is a particularly high-risk element of provisioning, as reusing a port can break routing to an existing production service. As such, this will be the first area we fully automate, with the provisioning service supplying the allocated port rather than requiring requesting teams to provide an already allocated port.</p> <p>Doing this will require moving the port registry out of a Phabricator wiki page and into a database, which will allow us to guard access with a variety of checks.</p> </li> <li> <p>Manual assignment of new services to servers often leads to new services being allocated to already heavily utilized servers. We will replace the manual assignment with an automated system, and do so with the intention of migrating to the Mesos/Aurora cluster once it is available for production workloads.</p> </li> </ol> <p>Each week, we&rsquo;ll review the size of the service provisioning queue, along with the service provisioning time to assess whether the strategy is working or needs to be revised.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>Prolonged strategy testing</strong></p> <p>Although I didn&rsquo;t have a name for this practice in 2014 when we created and implemented this strategy, the preceding paragraph captures an important truth of team-led bottom-up strategy: the entire strategy was implemented in a prolonged <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a> phase.</p> <p>This is an important truth of all low-attitude, bottom-up strategy: because you don&rsquo;t have the authority to mandate compliance. An executive&rsquo;s high-altitude strategy can be enforced despite not working due to their organizational authority, but a team&rsquo;s strategy will only endure while it remains effective.</p> </div> <h2 id="refine">Refine</h2> <p>In order to refine our diagnosis, we&rsquo;ve <a href="https://lethain.com/uber-service-onboarding-model/">created a systems model for service onboarding</a>. This will allow us to simulate a variety of different approaches to our problem, and determine which approach, or combination of approaches, will be most effective.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-provis-model-errors.png" alt="A systems model of provisioning services at Uber circa 2014."></p> <p>As we exercised the model, it became clear that:</p> <ol> <li>we are increasingly falling behind,</li> <li>hiring onto the service provisioning team is not a viable solution, and</li> <li>moving to a self-service approach is our only option.</li> </ol> <p>While the model writeup justifies each of those statements in more detail, we&rsquo;ll include two charts here. The first chart shows the status quo, where new service provisioning requests, labeled as <code>Initial RequestedServices</code>, quickly accumulate into a backlog.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-diag-1.png" alt="Initial diagram of Uber service provisioning model without error states."></p> <p>Second, we have a chart comparing the outcomes between the current status quo and a self-service approach.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-self-service.png" alt="Chart showing impact of self-service provisioning on provisioning rate."></p> <p>In that chart, you can see that the service provisioning backlog in the self-service model remains steady, as represented by the <code>SelfService RequestedServices</code> line. Of the various attempts to find a solution, none of the others showed promise, including eliminating all errors in provisioning and increasing the team&rsquo;s capacity by 500%.</p> <h2 id="diagnose">Diagnose</h2> <p>We&rsquo;ve diagnosed the current state of service provisioning at Uber as:</p> <ul> <li> <p>Many product engineering teams are aiming to leave the centralized monolith, which is generating two to three service provisioning requests each week. We expect this rate to increase roughly linearly with the size of the product engineering organization.</p> <p>Even if we disagree with this shift to additional services, there&rsquo;s no team responsible for maintaining the extensibility of the monolith, and working in the monolith is the number one source of developer frustration, so we don&rsquo;t have a practical counter proposal to offer engineers other than provisioning a new service.</p> </li> <li> <p>The engineering organization is doubling every six months. Consequently, a year from now, we expect eight to twelve service provisioning requests every week.</p> </li> <li> <p>Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing.</p> <p>Some additional headcount is being allocated to Service Reliability Engineers (SREs) who can take on the most nuanced, complicated service provisioning work. However, their bandwidth is already heavily constrained across many tasks, so relying on SRES is an insufficient solution.</p> </li> <li> <p>The queue for service provisioning is already increasing in size as things are today. Barring some change, many services will not be provisioned in a timely fashion.</p> </li> <li> <p>Today, provisioning a new service takes about a week, with numerous round trips between the requesting team and the provisioning team. Missing and incorrect information between teams is the largest source of delay in provisioning services.</p> <p>If the provisioning team has all the necessary information, and it&rsquo;s accurate, then a new service can be provisioned in about three to four hours of work across configuration in Puppet, metadata in Clusto, allocating ports, assigning the service to servers, and so on.</p> </li> <li> <p>There are few safeguards on port allocation, server assignment and so on. It is easy to inadvertently cause a production outage during service provisioning unless done with attention to detail.</p> <p>Given our rate of hiring, training the engineering organization to use this unsafe toolchain is an impractical solution: even if we train the entire organization perfectly today, there will be just as many untrained individuals in six months. Further, product engineering leadership has no interest in their team being diverted to service provisioning training.</p> </li> <li> <p>It&rsquo;s widely agreed across the infrastructure engineering team that essentially every component of service provisioning should be replaced as soon as possible, but there is no concrete plan to replace any of the core components. Further, there is no team accountable for replacing these components, which means the service provisioning team will either need to work around the current tooling or replace that tooling ourselves.</p> </li> <li> <p>It&rsquo;s urgent to unblock development of new services, but moving those new services to production is rarely urgent, and occurs after a long internal development period. Evidence of this is that requests to provision a new service generally come with significant urgency and internal escalations to management. After the service is provisioned for development, there are relatively few urgent escalations other than one-off requests for increased production capacity during incidents.</p> </li> <li> <p>Another team within infrastructure is actively exploring adoption of Mesos and Aurora, but there&rsquo;s no concrete timeline for when this might be available for our usage. Until they commit to supporting our workloads, we&rsquo;ll need to find an alternative solution.</p> </li> </ul> <h2 id="explore">Explore</h2> <p>Uber&rsquo;s server and service infrastructure today is composed of a handful of pieces. First, we run servers on-prem within a handful of colocations. Second, we describe each server in Puppet manifests to support repeatable provisioning of servers. Finally, we manage fleet and server metadata in a tool named Clusto, originally created by Digg, which allows us to populate Puppet manifests with server and cluster appropriate metadata during provisioning. In general, we agree that our current infrastructure is nearing its end of lifespan, but it&rsquo;s less obvious what the appropriate replacements are for each piece.</p> <p>There&rsquo;s significant internal opposition to running in the cloud, up to and including our CEO, so we don&rsquo;t believe that will change in the foreseeable future. We do however believe there&rsquo;s opportunity to change our service definitions from Puppet to something along the lines of Docker, and to change our metadata mechanism towards a more purpose-built solution like Mesos/Aurora or Kubernetes.</p> <p>As a starting point, we find it valuable to read <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf">Large-scale cluster management at Google with Borg</a> which informed some elements of the approach to Kubernetes, and <a href="https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf">Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center</a> which describes the Mesos/Aurora approach.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>If you&rsquo;re wondering why there&rsquo;s no mention of <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44843.pdf">Borg, Omega, and Kubernetes</a>, it&rsquo;s because it wasn&rsquo;t published until 2016, a year after this strategy was developed.</p> </div> <p>Within Uber, we have a number of ex-Twitter engineers who can speak with confidence to their experience operating with Mesos/Aurora at Twitter. We have been unable to find anyone to speak with that has production Kubernetes experience operating a comparably large fleet of 10,000+ servers, although presumably someone is operating&ndash;or close to operating&ndash;Kuberenetes at that scale.</p> <p>Our general belief of the evolution of the ecosystem at the time is <a href="https://lethain.com/wardley-compute-ecosystem/">described in this Wardley mapping exercise on service orchestration (2014)</a>.</p> <p><img src="https://lethain.com/static/blog/strategy/wardley-compute-v2.png" alt="Wardley map of evolution of service orchestration in 2014"></p> <p>One of the unknowns today is how the evolution of Mesos/Aurora and Kubernetes will look in the future. Kubernetes seems promising with Google&rsquo;s backing, but there are few if any meaningful production deployments today. Mesos/Aurora has more community support and more production deployments, but the absolute number of deployments remains quite small, and there is no large-scale industry backer outside of Twitter.</p> <p>Even further out, there&rsquo;s considerable excitement around &ldquo;serverless&rdquo; frameworks, which seem like a likely future evolution, but canvassing the industry and our networks we&rsquo;ve simply been unable to find enough real-world usage to make an active push towards this destination today.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><a href="https://lethain.com/wardley-mapping/">Wardley mapping</a> is introduced as one of the techniques for <a href="https://lethain.com/refining-eng-strategy/">strategy refinement</a>, but it can also be a useful technique for exploring an dynamic ecosystem like service orchestration in 2014.</p> <p>Assembling each strategy requires exercising judgment on how to compile the pieces together most usefully, and in this case I found that the map fits most naturally with the rest of exploration rather than in the more operationally-focused refinement section.</p> </div>Service onboarding model for Uber (2014).https://lethain.com/uber-service-onboarding-model/Thu, 09 Jan 2025 05:00:00 -0700https://lethain.com/uber-service-onboarding-model/<p>At the core of <a href="https://lethain.com/uber-service-migration-strategy/">Uber&rsquo;s service migration strategy (2014)</a> is understanding the service onboarding process, and identifying the levers to speed up that process. Here we&rsquo;ll develop a <a href="https://lethain.com/strategy-systems-modeling/">system model</a> representing that onboarding process, and exercise the model to test a number of hypotheses about how to best speed up provisioning.</p> <p>In this chapter, we&rsquo;ll cover:</p> <ol> <li>Where the model of service onboarding suggested we focus on efforts</li> <li>Developing a system model using the <a href="https://github.com/lethain/systems">lethain/systems</a> package on Github. That model <a href="https://github.com/lethain/eng-strategy-models/blob/main/UberServiceOnboarding.ipynb">is available in the lethain/eng-strategy-models</a> repository</li> <li>Exercising that model to learn from it</li> </ol> <p>Let&rsquo;s figure out what this model can teach us.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in</em> <em><a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="learnings">Learnings</h2> <p>Even if we model this problem with a 100% success rate (e.g. no errors at all), then the backlog of requested new services continues to increase over time. This clarifies that the problem to be solved is not the quality of service the service provisioning team is providing, but rather that the fundamental approach is not working.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-diag-1.png" alt="Initial diagram of Uber service provisioning model without error states."></p> <p>Although hiring is tempting as a solution, our model suggests it is not a particularly valuable approach in this scenario. Even increasing the Service Provisioning team&rsquo;s staff allocated to manually provisioning services by 500% doesn&rsquo;t solve the backlog of incoming requests.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-infra-hiring.png" alt="Chart showing impact of increased infrastructure engineering hiring on service provisioning."></p> <p>If reducing errors doesn&rsquo;t solve the problem, and increased hiring for the team doesn&rsquo;t solve the problem, then we have to find a way to eliminate manual service provisioning entirely. The most promising candidate is moving to a self-service provisioning model, which our model shows solves the backlog problem effectively.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-self-service.png" alt="Chart showing impact of self-service provisioning on provisioning rate."></p> <p>Refining our earlier statement, additional hiring may benefit the team if we are able to focus those hires on building self-service provisioning, and were able to <a href="https://lethain.com/productivity-in-the-age-of-hypergrowth/">ramp their productivity</a> faster than the increase of incoming service provisioning requests.</p> <h2 id="sketch">Sketch</h2> <p>Our initial sketch of service provisioning is a simple pipieline starting with <code>requested services</code> and moving step by step through to <code>server capacity allocated</code>. Some of these steps are likely much slower than others, but it gives a sense of the stages and where things might go wrong. It also gives us a sense of what we can measure to evaluate if our approach to provisioning is working well.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-provis-model.png" alt="A systems model of provisioning services at Uber circa 2014."></p> <p>One element worth mentioning are the dotted lines from <code>hiring rate</code> to <code>product engineers</code> and from <code>product engineers</code> to <code>requested services</code>. These are called <em>links</em>, which are stocks that influence another stock, but don&rsquo;t flow directly into them.</p> <div class="bg-light-gray br4 ph3 pv1"> <p>A purist would correctly note that links should connect to flows rather than stocks. That is true! However, as we&rsquo;ll encounter when we convert this sketch into a model, there are actually several counterintuitive elemnents here that are necessary to model this system but make the sketch less readable. As a modeler, you&rsquo;ll frequently encounter these sorts of tradeoffs, and you&rsquo;ll have to decide what choices serve your needs best in the moment.</p> </div> <p>The biggest missing element the initial model is missing is error flows, where things can sometimes go wrong in addition to sometimes going right. There are many ways things can go wrong, but we&rsquo;re going to focus on modeling three error flows in particular:</p> <ol> <li> <p><code>Missing/incorrect information</code> occurs twice in this model, and throws a provisioning request back into the initial provisioning phase where information is collected.</p> <p>When this occurs during port assignment, this is a relatively small trip backwards. However, when it occurs in Puppet configuration, this is a significantly larger step backwards.</p> </li> <li> <p><code>Puppet error</code> occurs in the second to final stock, <code>Puppet configuration tested &amp; merged</code>. This sends requests back one step in the provisioning flow.</p> </li> </ol> <p>Updating our sketch to reflect these flows, we get a fairly complete, and somewhat nuanced, view of the service provisioning flow.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-provis-model-errors.png" alt="A systems model of provisioning services at Uber circa 2014, with error transitions"></p> <p>Note that the combination of these two flows introduces the possibility of a service being almost fully provisioned, but then traveling from Puppet testing back to Puppet configuration due to <code>Puppet error</code>, and then backwards again to the intial step due to <code>Missing/incorrect information</code>. This means it&rsquo;s possible to lose almost all provisioning progress if everything goes wrong.</p> <p>There are more nuances we could introduce here, but there&rsquo;s already enough complexity here for us to learn quite a bit from this model.</p> <h2 id="reason">Reason</h2> <p>Studying our sketches, a few things stands out:</p> <ol> <li> <p>The hiring of product engineers is going to drive up service provisioning requests over time, but there&rsquo;s no counterbalancing hiring of infrastructure engineers to work on service provisioning. This means there&rsquo;s an implicit, but very real, deadline to scale this process independently of the size of the infrastructure engineering team.</p> <p>Even without building the full model, it&rsquo;s clear that we have to either stop hiring product engineers, turn this into a self-service solution, or find a new mechanism to discourage service provisioning.</p> </li> <li> <p>The size of error rates are going to influence results a great deal, particularly those for <code>Missing/incorrect information</code>. This is probably the most valuable place to start looking for efficiency improvements.</p> </li> <li> <p>Missing information errors are more expensive than the model implies, because they require coordination across teams to resolve. Conversely, Puppet testing errors are probably cheaper than the model implies, because they should be solvable within the same team and consequently benefit from a quick iteration loop.</p> </li> </ol> <p>Now we need to build a model that helps guide our inquiry into those questions.</p> <h2 id="model">Model</h2> <p>You can find the <a href="https://github.com/lethain/eng-strategy-models/blob/main/UberServiceOnboarding.ipynb">full implementation of this model on Github</a> if you want to see the entirety rather than these emphasized snippets.</p> <p>First, let&rsquo;s get the success states working:</p> <pre><code>HiringRate(10) ProductEngineers(1000) [PotentialHires] &gt; ProductEngineers @ HiringRate [PotentialServices] &gt; RequestedServices(10) @ ProductEngineers / 10 RequestedServices &gt; InflightServices(0, 10) @ Leak(1.0) InflightServices &gt; PortNameAssigned @ Leak(1.0) PortNameAssigned &gt; PuppetGenerated @ Leak(1.0) PuppetGenerated &gt; PuppetConfigMerged @ Leak(1.0) PuppetConfigMerged &gt; ServerCapacityAllocated @ Leak(1.0) </code></pre> <p>As we run this model, we can see that the number of requested services grows significantly over time. This makes sense, as we&rsquo;re only able to provision a maximum of ten services per round.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-diag-1.png" alt="Initial diagram of Uber service provisioning model without error states."></p> <p>However, it&rsquo;s also the best case, because we&rsquo;re not capturing the three error states:</p> <ol> <li>Unique port and name assignment can fail because of missing or incorrect information</li> <li>Puppet configuration can also fail due to missing or incorrect information.</li> <li>Puppet configurations can have errors in them, requiring rework.</li> </ol> <p>Let&rsquo;s update the model to include these failure modes, starting with unique port and name assignment. The error-free version looks like this:</p> <pre><code>InflightServices &gt; PortNameAssigned @ Leak(1.0) </code></pre> <p>Now let&rsquo;s add in an error rate, where 20% of requests are missing information and return to inflight services stock.</p> <pre><code>PortNameAssigned &gt; PuppetGenerated @ Leak(0.8) PortNameAssigned &gt; RequestedServices @ Leak(0.2) </code></pre> <p>Then let&rsquo;s do the same thing for puppet configuration errors:</p> <pre><code># original version PuppetGenerated &gt; PuppetConfigMerged @ Leak(1.0) # updated version with errors PuppetGenerated &gt; PuppetConfigMerged @ Leak(0.8) PuppetGenerated &gt; InflightServices @ Leak(0.2) </code></pre> <p>Finally, we&rsquo;ll make a similar change to represent errors made in the Puppet templates themselves:</p> <pre><code># original version PuppetConfigMerged &gt; ServerCapacityAllocated @ Leak(1.0) # updated version with errors PuppetConfigMerged &gt; ServerCapacityAllocated @ Leak(0.8) PuppetConfigMerged &gt; PuppetGenerated @ Leak(0.2) </code></pre> <p>Even with relatively low error rates, we can see that the throughput of the system overall has been meaningfully impacted by introducing these errors.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-diag-2.png" alt="Updated diagram of Uber service provisioning model with error states."></p> <p>Now that we have the foundation of the model built, it&rsquo;s time to start exercising the model to understand the problem space a bit better.</p> <h2 id="exercise">Exercise</h2> <p>We already know the errors are impacting throughput, but let&rsquo;s start by narrowing down which of errors matter most by increasing the error rate for each of them independently and comparing the impact.</p> <p>To model this, we&rsquo;ll create three new specifications, each of which increases one error from from 20% error rate to 50% error rate, and see how the overall throughput of the system is impacted:</p> <pre><code># test 1: port assignment errors increased PortNameAssigned &gt; PuppetGenerated @ Leak(0.5) PortNameAssigned &gt; RequestedServices @ Leak(0.5) # test 2: puppet generated errors increased PuppetGenerated &gt; PuppetConfigMerged @ Leak(0.5) PuppetGenerated &gt; InflightServices @ Leak(0.5) # test 3: puppet merged errors increased PuppetConfigMerged &gt; ServerCapacityAllocated @ Leak(0.5) PuppetConfigMerged &gt; PuppetGenerated @ Leak(0.5) </code></pre> <p>Comparing the impact of increasing the error rates from 20% to 50% in each of the three error loops, we can get a sense of the model&rsquo;s sensitivity to each error.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-diff-errors.png" alt="Chart showing impact of increased error rates in different stages of provisioning."></p> <p>This chart captures why exercising is so impactful: we&rsquo;d assumed during sketching that errors in puppet generation would matter the most because they caused a long trip backwards, but it turns out a very high error rate early in the process matters even more because there are still multiple other potential errors later on that compound on its increase.</p> <p>Next we can get a sense of the impact of hiring more people onto the service provisioning team to manually provision more services, which we can model by increasing the maximum size of the inflight services stock from <code>10</code> to <code>50</code>.</p> <pre><code># initial model RequestedServices &gt; InflightServices(0, 10) @ Leak(1.0) # with 5x capacity! RequestedServices &gt; InflightServices(0, 50) @ Leak(1.0) </code></pre> <p>Unfortunately, we can see that even increasing the team&rsquo;s capacity by 500% doesn&rsquo;t solve the backlog of requested services.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-infra-hiring.png" alt="Chart showing impact of increased infrastructure engineering hiring on service provisioning."></p> <p>There&rsquo;s some impact, but that much, and the backlog of requested services remains extremely high. We can conclude that more infrastructure hiring isn&rsquo;t the solution we need, but let&rsquo;s see if moving to self-service is a plausible solution.</p> <p>We can simulate the impact of moving to self-service by removing the maximum size from inflight services entirely:</p> <pre><code># initial model RequestedServices &gt; InflightServices(0, 10) @ Leak(1.0) # simulating self-service RequestedServices &gt; InflightServices(0) @ Leak(1.0) </code></pre> <p>We can see this finally solves the backlog.</p> <p><img src="https://lethain.com/static/blog/strategy/uber-model-chart-self-service.png" alt="Chart showing impact of self-service provisioning on provisioning rate."></p> <p>At this point, we&rsquo;ve exercised the model a fair amount and have a good sense of what it wants to tell us. We know which errors matter the most to invest in early, and we also know that we need to make the move to a self-service platform sometime soon.</p>Refining strategy with Wardley Mapping.https://lethain.com/wardley-mapping/Thu, 02 Jan 2025 06:00:00 -0700https://lethain.com/wardley-mapping/<p>The first time I heard about Wardley Mapping was from Charity Majors discussing it on Twitter. Of the three core <a href="https://lethain.com/refining-eng-strategy/">strategy refinement techniques</a>, this is the technique that I&rsquo;ve personally used the least. Despite that, I decided to include it in this book because it highlights how many different techniques can be used for refining strategy, and also because it&rsquo;s particularly effective at looking at the broadest ecosystems your organization exists in.</p> <p>Where the other techniques like <a href="https://lethain.com/strategy-systems-modeling/">systems thinking</a> and <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a> often zoom in, Wardley mapping is remarkably effective at zooming out.</p> <p>In this chapter, we&rsquo;ll cover:</p> <ul> <li>A ten-minute primer on Wardley mapping</li> <li>Recommendations for tools to create Wardley maps</li> <li>When Wardley maps are an ideal strategy refinement tool, and when they&rsquo;re not</li> <li>The process I use to map, as well as integrate a Wardley map into strategy creation</li> <li>Breadcrumbs to specific Wardley maps that provide examples</li> <li>Documenting a Wardley map in the context of a strategy writeup</li> <li>Why I limited focus on two elements of Wardley&rsquo;s work: doctrines and gameplay</li> </ul> <p>After working through this chapter, and digging into some of this book&rsquo;s examples of Wardley Maps, you&rsquo;ll have a good background to start your own mapping practice.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><em>This is an exploratory, draft chapter for a book on engineering strategy that I&rsquo;m brainstorming in <a href="https://lethain.com/tags/eng-strategy-book/">#eng-strategy-book</a>.</em> <em>As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.</em></p> </div> <h2 id="ten-minute-primer">Ten minute primer</h2> <p>Wardley maps are a technique created by Simon Wardley to ensure your strategy is grounded in reality. Or, as mapping practioners would say, it&rsquo;s a tool for creating situational awareness. If you have a few days, you might want to start your dive into Wardley mapping by reading Simon Wardley&rsquo;s book on the topic, <em><a href="https://medium.com/wardleymaps/on-being-lost-2ef5f05eb1ec">Wardley Maps</a></em>. If you only have ten minutes, then this section should be enough to get you up to speed on reading Wardley maps.</p> <p>Picking an example to work through, we&rsquo;re going to create a Wardley map that aims to understand a knowledge base management product, along the lines of a wiki like Confluence or Notion.</p> <p><img src="https://lethain.com/static/blog/strategy/intro-wardley-init.png" alt="Diagram showing a basic Wardley map for a knowledge base management application."></p> <p>You need to know three foundational concepts to read a Wardley map:</p> <ol> <li> <p>Maps are populated with three kinds of components: users, needs, and capabilities. Users exist at the top, and represent a cohort of users who will use your product. Each kind of user has a specific set of needs, generally tasks that they need to accomplish. Each need requires certain capabilities required to fulfill that need.</p> <p>Any box connecting directly to a user is a need. Any box connecting to a need is a capability. A capability can be connected to any number of needs, but can never connect directly to a user; they connect to users only indirectly via a need.</p> </li> <li> <p>The x-axis is divided into four segments, representing how commoditized a capability is. On the far left is genesis, which represents a brand-new capability that hasn&rsquo;t existed before. On the far right is commoditized, something so standard and expected that it&rsquo;s unremarkable, like turning on a switch causing electricity to flow. In between are custom and product, the two categories where most items fall on the map. Custom represents something that requires specialized expertise and operation to function, such as a web application that requires software engineers to build and maintain. Product represents something that can generally be bought.</p> <p>In this map, document reading is commoditized: it&rsquo;s unremarkable if your application allows its users to read content. On the other hand, document editing is someone on the border of product and custom. You might integrate an existing vendor for document editing needs, or you might build it yourself, but in either case document editing is less commoditized than document reading.</p> </li> <li> <p>The y-axis represents visibility to the user. In this map, reading documents is something that is extremely visible to the user. On the other hand, users depend on something indexing new documents for search, but your users will generally have no visibility into the indexing process or even that you have a search index to begin with.</p> </li> </ol> <p>Although maps can get quite complex, those three concepts are generally sufficient to allow you to decode an arbitrarily complex map.</p> <p>In addition to mapping the current state, Wardley maps are also excellent at exploring how circumstances might change over time. To illustrate that, let&rsquo;s look at a second iteration of our map, paying particular attention to the red arrows indicating capabilities that we expect to change in the future.</p> <p><img src="https://lethain.com/static/blog/strategy/intro-wardley-future.png" alt="Diagram showing a basic Wardley map for a knowledge base management application."></p> <p>In particular, the map now indicates that the current document creation experience will be superseded by an AI-enhanced editing process. Critically, the map also predicts that the AI-enhanced process will be more commoditized than its current authoring experience, perhaps because the AI-enhancement will be driven by commoditized foundational models from providers like Anthropic and OpenAI. Building on that, the only place left in the map for meaningful differentiation is in search indexing. Either the knowledge base company needs to accept the implication that they will increasingly be a search company, or they need to expand the user needs they service to find a new avenue for differentiation.</p> <p>Some maps will show evolution of a given capability using a &ldquo;pipeline&rdquo;, a box that describes a series of expected improvements in a capability over time.</p> <p><img src="https://lethain.com/static/blog/strategy/intro-wardley-future-pipeline.png" alt="Diagram showing a basic Wardley map for a knowledge base management application."></p> <p>Now instead of simply indicating that the authoring experience may be replaced by an AI-enhanced capability over time, we&rsquo;re able to express a sequence of steps. From the starting place of a typical editing experience, the next expected step is AI-assisted creation, and then finally we expect AI-led creation where the author only provides high-level direction to a machine learning-powered agent.</p> <p>For completeness, it&rsquo;s also worth mentioning that some Wardley maps will have an overlay, which is a box to group capabilities or requirements together by some common denominator. This happens most frequently to indicate the responsible team for various capabilities, but it&rsquo;s a technique that can be used to emphasize any interesting element of a map&rsquo;s topology.</p> <p><img src="https://lethain.com/static/blog/strategy/intro-wardley-team-overlay.png" alt="Diagram showing a basic Wardley map for a knowledge base management application, with an overlay to show which teams own which capabilities."></p> <p>At this point, you have the foundation to read a Wardley map, or get started creating your own. Maps you encounter in the wild might appear singificantly more complex than these initial examples, but they&rsquo;ll be composed of the same fundamental elements.</p> <div class="bg-light-gray br4 ph3 pv1"> <p><strong>More Wardley Mapping resources</strong></p> <p><em><a href="https://itrevolution.com/product/the-value-flywheel-effect/">The Value Flywheel Effect</a></em> by David Anderson</p> <p><em><a href="https://medium.com/wardleymaps/on-being-lost-2ef5f05eb1ec">Wardley Maps</a></em> by Simon Wardley on Medium, also <a href="https://learnwardleymapping.com/book/">available as PDF</a></p> <p><a href="https://learnwardleymapping.com/">Learn Wardley Mapping</a> by Ben Mosior</p> <p><a href="https://list.wardleymaps.com/">wardleymaps.com&rsquo;s resources</a> and <a href="https://www.youtube.com/wardleymaps">@WardleyMaps on Youtube</a></p> </div> <h2 id="tools-for-wardley-mapping">Tools for Wardley Mapping</h2> <p>Systems modeling has a serious tooling problem, which often prevents would-be adopters from developing their systems modeling practice. Fortunately, Wardley Mapping doesn&rsquo;t suffer from that problem. Uou can simply print out a Wardley Map and draw on it by hand. You can also use OmniGraffle, Miro, Figma or whatever diagramming tool you&rsquo;re already familiar with.</p> <p>There are more focused tools as well, with Ben Mosior pulling together an excellent writeup on <a href="https://learnwardleymapping.com/2024/06/24/top-5-wardley-mapping-tools-for-2024/">Wardley Mapping Tools as of 2024</a>. Of those two, I&rsquo;d strongly encourage starting with <a href="https://mapkeep.com/">Mapkeep</a> as a simple, free, and intuitive tool for your innitial mapping needs.</p> <p>After you&rsquo;ve gotten some practice, you may well want to move back into your most familiar diagramming tool to make it easier to collaborate with colleagues, but initially prioritize the simplest tool you can to avoid losing learning momentum on configuration, setup and so on.</p> <h2 id="when-are-wardley-maps-useful">When are Wardley Maps useful?</h2> <p>All successful strategy begins with understanding the constraints and circumstances that the strategy needs to work within. Wardley mapping labels that understanding as situational awareness, and creating situational awareness is the foremost goal of mapping.</p> <p>Situational awareness is always useful, but it&rsquo;s particularly essential in highly dynamic environments where the industry around you, competitors you&rsquo;re sellinga gainst, or the capabilities powering your product are shifting rapidly. In the past several decades, there have been a number of these dynamic contexts, including the rise of web applications, the proliferation of mobile devices, and the expansion of machine learning techniques.</p> <p>When you&rsquo;re in those environments, it&rsquo;s obvious that the world is changing rapidly. What&rsquo;s sometimes easy to miss is that any strategy the needs to last longer than a year or two is build on an evolving foundation, even if things seem very stable at the time. For example, in the early 2010s, startups like Facebook, Uber and Digg were all operating in physical datacenters with their owned hardware. Over a five year period, having a presence in a physical datacenter went from the default approach for startups to a relatively unconventional solution, as cloud based infrastructure rapidly expanded. Any strategy written in 2010 that imagined the world of hosting was static, was destinated to be invalidated.</p> <p>No tool is universally effective, and that&rsquo;s true here as well. While Wardley maps are extremely helpful at understanding broad change, my experience is that they&rsquo;re less helpful in the details. If you&rsquo;re looping to optimize your onboarding funnel, then something like <a href="https://lethain.com/strategy-systems-modeling/">systems modeling</a> or <a href="https://lethain.com/testing-strategy-iterative-refinement/">strategy testing</a> are likely going to serve you better.</p> <h2 id="how-to-wardley-map">How to Wardley Map</h2> <p>Learning Wardley mapping is a mix of reading others&rsquo; maps and writing your own. A variety of maps for reading are collected in the following breadcrumbs section, and I&rsquo;d recommend skimming all of them. In this section are the concrete steps I&rsquo;d encourage you to follow for creating the first map of your own:</p> <ol> <li> <p><strong>Commit to starting small and iterating.</strong> Simple maps are the foundation of complex maps. Even the smallest Wardley map will have enough detail to reveal something interesting about the environment you&rsquo;re operating in.</p> <p>Conversely, by starting complex, it&rsquo;s easy to get caught up in all of your early map&rsquo;s imperfections. At worst, this will cause you to lose momentum in creating the map. At best, it will accidentally steer your attention rather than facilitating discover of which details are important to focus on.</p> </li> <li> <p><strong>List users, needs and capabilities.</strong> Identify the first one or two users for your product. Going back to the knowledge management example from the primer, your two initial users might be an author and a reader. From there, identify those users&rsquo; needs, such as authoring content, finding content, and providing feedback on which content is helpful. Finally, write down the underlying technical capabilities necessary to support those needs, which might range from indexing content in a search index to a customer support process to deal with frustrated users.</p> <p>Remember to start small! On your first pass, it&rsquo;s fine to focus on a single user. As you iterate on your map, bring in more users, needs and capabilities until the map conveys something useful.</p> <p>Tooling for this can be a piece of paper or wherever you keep notes.</p> </li> <li> <p><strong>Establish value chains.</strong> Take your list and then connect each of the components into chains. For example, the reader in the above knowledge base example would then be connected to needing to discover content. Discovering content would be linked to indexing in the search index. That sequence from reader to discovering content to search index represents one value chain.</p> <p>Convergence across chains is a good thing. As your chains get more comprehensive, it&rsquo;s expected that a given capability would be referenced by multiple different needs. Similarly, it&rsquo;s expected that multiple users might have a shared need.</p> </li> <li> <p><strong>Plot value chains</strong> on a Wardley Map. You can do this using any of the tools discussed in the Tools for Wardley mapping section, including a piece of paper.</p> <p>Because you already have the value chains created, what you&rsquo;re focused on in this step is placing each component relative to it&rsquo;s visibility to users (higher up is more visible to the user, lower down is less visible), and how mature the solutions are (leftward represents more custom solutions, rightward represents most commoditized solutions).</p> </li> <li> <p><strong>Study current state</strong> of the map. With the value chains plotted on your map, it will begin to reveal where your organization&rsquo;s attention should be focused, and what complexity you can delegate to vendors. Jot down any realizations you have from this topology.</p> </li> <li> <p><strong>Predict</strong> evolution of the map, and create a second version of your map that includes these changes. (Keep the previous version so you can better see the evolution of your thinking!)</p> <p>It can be helpful to create multiple maps that contemplate different scenarios. Thinking about the running knowledge base example, you might contemplate a future where AI-powered tools become the dominant mechanism for authors creating content. Then you could explore another future where such tools are regulated out of most tools, and imagine how that would shape your approach differently.</p> <p>Picking the timeframe for these changes will vary on the evironment you&rsquo;re mapping. Always prefer a timeframe that makes it easy to believe changes will happen, maybe that&rsquo;s five years, or maybe it&rsquo;s 12 months. If you&rsquo;re caught up wondering whether change might take longer a certain timeframe, than simply extend your timeframe to sidestep that issue.</p> </li> <li> <p><strong>Study future state</strong> of the map, now that you&rsquo;ve predicted the future, Once again, write down any unexpected implications of this evolution, and how you may need to adjust your approach as a result.</p> </li> <li> <p><strong>Share with others</strong> for feedback. It&rsquo;s impossible for anyone to know everything, which is why the best maps tend to be a communal creation. That&rsquo;s not to suggest that you should perform every step in a broad community, or that your map should be the consensus of a working group. Instead, you should test your map against others, see what they find insightful and what they find artificial in the map, and include that in your map&rsquo;s topology.</p> </li> <li> <p><strong>Document</strong> what you&rsquo;ve learned as discussed below in the section on documentation. You should also connect that Wardley map writeup with your overall strategy document, typically in the <a href="https://lethain.com/components-of-eng-strategy/">Refine or Explore sections</a>.</p> </li> </ol> <p>One downside of presenting steps to do something is that the sequence can become a fixed recipe. These are the steps that I&rsquo;ve found most useful, and I&rsquo;d encourage you to try them if mapping is a new tool in your toolkit, but this is far from the canonical way. Start here, then experiment with other approaches until you find the best approach for you and the strategies that you&rsquo;re working on.</p> <h2 id="breadcrumbs-for-wardley-map-examples">Breadcrumbs for Wardley Map examples</h2> <div class="bg-light-gray br4 ph3 pv1"> <p><em>I&rsquo;ll update these examples as I continue writing more strategies for this book.</em> <em>Until then, I admit that some of these examples are &ldquo;what I have laying around&rdquo; moreso than the &ldquo;ideal forms of Wardley maps.&rdquo;</em></p> </div> <p>With the foundation in place, the best way to build on Wardley mapping is writing your own maps. The second best way is to read existing maps that others have made, and a number of which exist within this book:</p> <ul> <li><a href="wardley-llm-ecosystem">LLM evolution</a> studies the evolution of the Large Language Model ecosystem, and how that will impact product engineering organizations attempting to validate and deploy new paradigms like agentic workflows and retrieval augmented generation</li> <li><a href="https://lethain.com/measuring-developer-experience-benchmarks-theory-of-improvement/">Evolution of developer experience tooling space</a> explores how Wardley mapping has helped me refine my understanding of how the developer experience ecosystem will evolve over time</li> </ul> <p>In addition to the maps within this book, I also label maps that I create on my blog using the <a href="https://lethain.com/tags/wardley/">wardley category</a>.</p> <h2 id="how-to-document-a-wardley-map">How to document a Wardley Map</h2> <p>As explored in <a href="https://lethain.com/readable-engineering-strategy-documents/">how to create readable strategy documents</a>, the default temptation is to structure documents around the creation process. However, it&rsquo;s essentially always better to write in two steps: develop a writing-optimization version that&rsquo;s focused on facilitating thinking, and then rework it into a reading-optimized version that supports both readers who are, and are not, interested in the details.</p> <p>The writing-optimized version is what we discussed in &ldquo;How to Wardley Map&rdquo; above. For a reading-optimized version, I recommend:</p> <ol> <li> <p><strong>How things work today</strong> shares a map of the current environment, explains any interesting rationales or controversies behind placements on the map, and highlights the most interesting parts of the map</p> </li> <li> <p><strong>Transition to future state</strong> starts with a second map, this one showing the transition from the current state to a projected future state. It&rsquo;s very reasonable to have multiple distinct maps, each of which considers one potential evolution, or one step of a longer evolution.</p> </li> <li> <p><strong>Users and Value chains</strong> are the first place you start creating a Wardley map, but generally the least interesting part of explaining a map&rsquo;s implications. This isn&rsquo;t because the value chains are unimportant, rather it&rsquo;s because the map itself tends to implicitly explain the value chain enough that you can move directly to focusing on the map&rsquo;s most interesting implications.</p> <p>In a sufficiently complex, it&rsquo;s very reasonable to split this into two sections, but generally I find it eliminates redundency to cover users and value chains in one joint section rather than separately. This is a good example of the difference between reading and writing: splitting these two topics helps clarify thinking, but muddles reading.</p> </li> </ol> <p>This ordering may seem too brief or a bit counter-intuitive for you, as the person who has the full set of details, but my experience is that it will be simpler to read for most readers. That&rsquo;s because most readers read until they agree with the conclusion, then stop reading, and are only interested in the details if they disagree with the conclusion.</p> <p>This format is also fairly different than the format I recommend for documenting systems models. That is because systems model diagrams exclude much of the relevant detail, showing the relationship between stocks but not showing the magnitude of the flows. You can only fully understand a system model by seeing both the diagram and a chart showing the model&rsquo;s output. Wardley maps, on the other hand, tend to be more self-explanatory, and often can stand on their own with relatively less written description.</p> <h2 id="what-about-doctrines-and-gameplay">What about doctrines and gameplay?</h2> <p>This book&rsquo;s <a href="https://lethain.com/components-of-eng-strategy/">components of strategy</a> are most heavily influenced by Richard Rumelt&rsquo;s approach. Simon Wardley&rsquo;s approach to strategy built around Wardley Mapping could be viewed as a competing lens. For each problem that Rumelt&rsquo;s system solves, there is a Wardley solution as well, and it&rsquo;s worth mentioning some of the components I&rsquo;ve not included, and why I didn&rsquo;t.</p> <p>The two most important components I&rsquo;ve not discussed thus far are Wardley&rsquo;s ideas of <a href="https://learnwardleymapping.com/2020/08/17/principles-first/">doctrine</a> and <a href="https://www.wardleymaps.com/gameplay">gameplay</a>. Wardley&rsquo;s doctrine are universally applicable practices like knowing your users, biasing towards data, and design for constant evolution. Gameplay is similar to doctrine, but is context-dependent rather than universal. Some examples of gameplay are talent raid (hiring from knowledgable competitior), bundling (selling products together rather than separately), and exploiting network effects.</p> <p>I decided not to spend much time on doctrine and gameplay because I find them lightly specialized on the needs of business strategy, and consequently a bit messy to apply to the sorts of problems that this book is most interested in solving: the problems of engineering strategy.</p> <p>To be explicit, I don&rsquo;t personally view Rumelt&rsquo;s approach and Wardley&rsquo;s approaches as competing efforts. What&rsquo;s most valuable is to have a broad toolkit, and pull in the pieces of that toolkit that feel most applicable to the problems at hand. I find Wardley Maps exceptionally valuable at enhancing exploration, diagnosis, and refinement in some problems. In other problems, typically shorter duration or more internally-oriented, I find the Rumelt playbook more applicable. In all problems, I find the combination more valuable than anchoring in one camp&rsquo;s perspective.</p> <h2 id="summary">Summary</h2> <p>No refinement technique will let you reliably predict the future, but Wardley mapping is very effective at helping you plot out the various potential futures your strategy might need to operate in. With those futures in mind, you can tune your strategy to excel in the most likely, and to weather the less desirable.</p> <p>It took me years to dive into Wardley mapping. Once I finally did, it was simpler than I&rsquo;d feared, and now I find myself creating Wardley maps somewhat frequently. When you&rsquo;re working on your next strategy that&rsquo;s impacted by the ecosystem&rsquo;s evolution around it, try your hand at mapping, and soon you&rsquo;ll <a href="https://lethain.com/tags/wardley/">start to build your own collection of maps</a>.</p>