Irrational Exuberance

Developing domain expertise: get your hands dirty.

Tue, 16 Jul 2024 20:00:00 -0700

Recently, I’ve been thinking about developing domain expertise, and wanted to collect my thoughts here. Although I covered some parts of this in Your first 90 days as CTO (understanding product analytics, shadowing customer support, talking to customers, and talking with your internal experts), I missed the most important dimension of effective learning: getting your hands dirty.

At Carta, I’m increasingly spending time focused on our fund financials business, which requires a deep understanding of accounting. I did not join Carta with a deep understanding of accounting.Initially, I hoped that I would learn enough accounting through customer escalations, project review, and so on, but recently decided I also needed to work through Financial Accounting, 11th Edition by Weygandt, Kimmel, and Kieso.

The tools for building domain expertise vary quite a bit across companies, and I found the same tools ranged from excellent to effectively useless when applied across Stripe (an increasingly expansive platform for manipulating money online), SocialCode (a Facebook advertising optimization company), and Carta (a platform for fund administration and a platform for cap table management). Here are some notes about approaches taken at specific companies, followed by some generalized recommendations.

Uber

Uber likely had the simplest and most effective strategy of any product I’ve worked on: each employee got several hundred dollars of Uber credits every month to use the product. This, combined with the fact that almost all early hires lived in markets that had an active Uber marketplace going, meant that our employees intimately experienced the challenges.

This was particularly impactful for folks who traveled to other cities or countries and experienced using Uber there. Often the experience was pretty inconsistent across cities, and experiencing that inconsistency directly was extremely valuable.

Carta

Returning to my starting paragraph on Carta, Carta operates in a series of deep, complex domains: equity management is a complex legal domain, and fund administration is a complex accounting domain. Ramping in either, let alone both, is difficult.

Carta has an unlimited individual book budget, and they pay for the Certified Equity Professional (CEP) test. These are good educational benefits, but are more a platform that you can build on than the actual learning itself. Teams working on products tend to develop deep domain expertise by building in that domain, but that approach is difficult to apply as an executive as I’m typically simultaneously engaging with so many different products and problems.

In addition to the standard foundation of domain learning (talking to customers, digging into product and business analytics, etc), I’ve found three mechanisms particularly helpful: our executive sponsor program, reading textbooks, and initiative-specific deep dives.

For our executive sponsor program, we have a C-level executive assigned to key customers, who are involved in every escalation, periodic check-ins and advocating for those customers in our roadmap planning. By design, being a sponsor is painful when things don’t go well, and that is a very pure, effective learning mechanism: figure out the customer’s problem, and then track resolving it through the company. Some days I don’t enjoy being a sponsor, but it’s the most effective learning mechanism I’ve found for our exceptionally deep domains, and I’m grateful we rolled the program out.

Second, I’ve found book learning very effective at creating a foundation to dig into product and technical considerations in the accounting domain. For example, soon after joining I found a short refresher on accounting, reading Accounting Made Simple by Mike Piper in a couple of hours. Later, I worked through the Partnership Accounting course on Udemy, and now I’m working through two textbooks, Financial Accounting, 11th Edition and Understanding Partnership Accounting.

Finally, initiative-specific deep dives have been a good opportunity to work directly with a team on a narrow problem until we solved a complex problem together. This taught me a lot about the domain, the individuals, and hopefully provided them with a better sense and relationship with me as an executive sponsoring a project they also cared about. My first big project was working with our payments infrastructure team to support automated money movement in our fund administration product, and I learned so much from the team on that project. I also know there’s no chance I’d understand the complexities at the intersection of money movement and fund administration so well if I hadn’t gotten to work with them on that project.

Stripe

At the time I joined Stripe, all new employees were encouraged to read Payment Systems in the US. More ambitious folks usually built a straightforward Stripe store of some sort: David Singleton created a site to sell journals, and Michelle Bu maintained a store that sold t-shirts with the seconds since epoch printed on them. Building a store was a great educational experience, but maintaining the store live was significantly more valuable in understanding the friction that bothered our users. Things like forced upgrades or late tax forms are abstract when imposed on others, and illuminating when you experience them directly.

As Stripe got increasingly broad and complex, it became increasingly difficult for anyone to maintain a deep understanding of the entire stack. To combat that trend, executives relied more on mechanisms like project-driven learning on high-priority projects, and executive sponsors for key customers. They certainly also relied on standard mechanisms like talking to customers frequently, frequently reviewing product data, and so on.

Intercom

I met Brian Scanlan some years back, who told me that executives at Intercom would start each offsite by doing a quick integration of their product into a new website. The goal wasn’t to do a novel integration, but to stay close to the integration experience of their typical user. I still find this a fairly profound idea, and I tried a variation of this idea at Carta’s most recent executive offsite, making every executive start the offsite by performing a handful of fund administration tasks on a demo fund’s data.

Felt

Chatting with one of the founders at Felt, Can Duruk, about this topic, he mentioned that they maintain an introduction to Geographic Information Systems for both employees and users to understand the domain. They also hired an in-house cartographer who helps educate the team on the details of map making.

Recommendations

The recommendations I would make are embedded in the specific stories above, but I’ll compact them into a list as well for easier reference. Some particularly useful mechanism for senior leaders to develop domain expertise are:

Reviewing product analytics on a recurring basis. Your goal is to build an intuition around how the data should move, and then refine that intuition against how the data moves in reality.
Shadow customer support to see customer issues and how those issues are resolved.
Assign named executive sponsors for key customers. Those sponsors should meet with those customers periodically, be a direct escalation point for those customers, be aware of issues impacting those customers, and be an advocate for those customers’ success.
Directly use or integrate with the product. Try to find ways that more closely different customer cohorts rather than just what you find most common. For example, if you only used Uber in San Francisco in 2014, you had a radically misguided intuition about how well Uber worked.
Make an executive offsite ritual around using the product. Follow Intercom’s approach to routinely integrate the core parts of your product from scratch, experiencing the challenges of your new users over and over to ensure they don’t degrade.
Use executive initiatives as an opportunity to dig deep into particular areas of the business. Over the past year, the areas at Carta that I’ve learned the best are the ones where I embedded myself temporarily into a team dealing with a critical problem and kept with them until the problem was resolved.
Use a textbook or course driven approach to understand the underlying domain that you’re working in. This applies from Uber’s marketplace management to Carta’s accounting.

The details of ramping up on a specific domain will always vary a bit, but hopefully something in there gives you a useful starting point for digging into yours. So often executives take a view that the constraints are a problem for their teams, but I think great executive leadership only exists when individuals can combine the abstract mist of grand strategy with the refined nuance of how things truly work. If this stuff seems like the wrong use of your time, that’s something interesting to reflect on.

Physics and perception.

Sat, 29 Jun 2024 07:00:00 -0700

At one point in 2019, several parts of Stripe’s engineering organization were going through a polite civil war. The conflict was driven by one group’s belief that Java should replace Ruby. Java would, they posited, address the ongoing challenge of delivering a quality platform in the face of both a rapidly growing business and a rapidly growing engineering organization. The other group believed Stripe’s problems were driven by a product domain with high essential complexity and numerous, demanding external partners ranging from users to financial institutions to governments; switching programming languages wouldn’t address any of those issues. I co-wrote the internal version of Magnitudes of exploration in an attempt to find a useful framework for navigating that debate, but nonetheless the two groups struggled to make much progress in understanding one another.

I was reminded of those discussions while reading Steven Sinofsky’s Hardcore Software’s chapter on Innovation versus Shipping: The Cairo Project:

Landing on my desk early in 1993 was the first of many drafts of Cairo plans and documents. Cairo took the maturity of the NT product process—heavy on documentation and architectural planning—and amped it up. Like a well-oiled machine, the Cairo team was in short order producing reams of documents assembled into three-inch binders detailing all the initiatives of the product. Whenever I would meet with people from Cairo, they would exude confidence in planning and their processes. … While any observer should have rightfully taken the abundance of documentation and confidence of the team as a positive sign, the lack of working code and ever-expanding product definition seemed to set off some minor alarms, especially with the Apps person in me. While the Cairo product had the appearance of the NT project in documentation, it seemed to lack the daily rigorous builds, ongoing performance and benchmarking, and quality and compatibility testing. There was a more insidious dynamic, and one that would prove a caution to many future products across the company but operating systems in particular.

The simple narrative regarding both the Cairo development and Java migration is that there’s a group doing the “right” thing, and another group doing the “wrong” thing. The Cairo team was shipping vaporware. The Java team was incorrectly diagnosing the underlying problems. These sorts of descriptions are comforting because they create the familiar narrative structure of “good” in conflict with “evil.” Unfortunately, I’ve never found these sorts of narratives very useful for understanding what causes a conflict, and they’re worse than useless at actually resolving conflicts.

What I have found useful is studying what each faction knows that the other doesn’t, and trying to understand those gaps deeply enough to find a solution. Sometimes I summarize this as " solving for both physics and perception."

Solving for perception

Sinofsky’s represents Cairo as an impossibly broad project that didn’t ship, but he also explains why it picked up so many features:

Cairo tended to take this as a challenge to incorporate more and more capabilities. New things that would come along would be quickly added to the list of potential features in the product. Worse, something that BillG might see conceptually related, like an application from a third party for searching across all the files on your hard disk, might become a competitive feature to Cairo. Or more commonly “Can’t Cairo just do this with a little extra work?” and then that little extra work was part of the revised product plans.

It wasn’t ill-intentioned, rather they simply wanted to live up to their CEO’s expectations. They wanted to be perceived as succeeding within their company’s value system, because they correctly understood that their project would be canceled otherwise.

Many incoming leaders find themselves immediately stuck in similar circumstances. They’ve just joined and don’t understand the domain or team very well, but are being told they need to immediately make progress on a series of problems that have foiled the company’s efforts thus far. They know they need to appear to be doing something valuable, so they do anything that might look like progress. It’s particularly common for leaders to begin a Grand Migration at that moment, which they hope will solve the problems at hand, but no matter what will be perceived as a brave, audacious initiative.

This isn’t a problem unique to executives or product engineers, I frequently see platform teams make the same mistake when they undertake large-scale migrations. Many platform migrations are structured as an organizational program where a platform team tells product teams they need to complete a certain task (e.g. “move to our monorepo”) by a certain date, along with tracking dashboards that inform executives which teams have or haven’t completed their tasks. This does a great job of applying pressure to the underlying teams, and a good job of managing perceptions by appearing to push hard, but these migrations often fail because there’s little emphasis on the underlying ergonomics of the migration itself. If you tell teams they are failing if they miss a date, they will try to hit the date; if it’s hard, they’ll still fail. Platform teams in that case often blame the product teams for not prioritizing their initiative, when instead the platform teams should have the self-awareness to recognize that they made things difficult by not simplifying the underlying physics for the product teams they asked to migrate.

There’s nothing wrong about solving for perception, and indeed it’s a necessary skill to be an effective leader. Rather the lesson here is that most meaningful projects require solving for both perception and physics.

Solving for physics

When I joined Stripe, one of the first projects I wanted to take on was migrating to Kubernetes and away from hand-rolled tooling for managing VMs directly. This was heavily influenced by what I had learned migrating Uber from a monolithic Python application to polygot applications in a polyrepo. After a few months of trying to build alignment within engineering, I postponed the Kubernetes migration for a few years because I couldn’t convince them it solved a pressing problem. (I did come back to it, and it was a success when I did.) I could have forced the team to work on that project, but it goes against my instincts: generally when engineers push back on leadership ideas, there’s a good reason for doing so.

Similarly, my initial push at Stripe was not toward the Ruby typing work that became Sorbet, but rather to design an incremental migration towards an existing statically-typed language such as Java or Go. The argument I got back was that this was impractical because it required too large a migration effort, and that Facebook’s Hack had already proven out the viability of moving from PHP to a PHP-like typed language. I took my time to understand the pushback, and over time shifted my thinking to focus instead on sequencing these efforts: even if we wanted to move to a different language, first we needed to improve the architecture to support migrating modules, and that effort would benefit from typing Ruby.

I was fortunate in these cases, because there were few perceptions that I needed to solve for, and I was able to mostly focus on the physics. Indeed, the opportunity to focus on physics is one of the undervalued advantages of working within infrastructure engineering. You’ll rarely be lucky enough in senior leadership roles to focus on the physics.

For example, when I joined Carta, there was pressure across the industry and internally to increase our investment into using LLMs. Most engineers were quite skeptical of the opportunity to use LLMs, so if I’d listened exclusively to the physics, I would have probably ignored the pressure to adopt. However, that would have led me astray in two ways. First, I would have seriously damaged the wider executive team’s belief in my ability to incorporate new ideas. Second, physics are anchored on how we understand the world today, and LLMs are a place where things are evolving quickly. Our approach to using LLMs in our product is better than anything we would have gotten to by only solving for physics. (And vastly better than we’d have come up with if we’d only solved for perception.)

I think the LLM example is instructive because it violates the expectation that “physics” are real and “perceptions” are false. It can go both ways, depending on the circumstances. As soon as you get complacent about your perspective representing reality, you’ll quickly be disabused of that notion.

Balancing physics and perception

Effective leaders meld perception and physics into approaches that solve both. This is hard to do, takes a lot of energy, and when done well often doesn’t even look like you’re doing that much. Many leaders try to solve both, but eventually give in to the siren’s song of applying perception pressure without a point of view on how that pressure should be channeled into a physical plan. Applying pressure without a plan is the same issue as the infrastructure migration, where you can certainly create accountability, but it’s pretty likely to fail.

Pressure without a plan is appropriate at some level of seniority, and it’s important to understand within a given organization where responsibility lies for appending a plan to the pressure. In a small startup (10s of people), that’s probably the founders. In a medium-sized company (100s of people), that’s likely the executive team. As the company grows, more and more of the plan will be devised further from the physics, but you always have to decide where planning should start.

There is always a point where an organization will simply give up on planning and allow the pressure to cascade undeterred. In a high-functioning organization, that pressure point is quite high. In lower-functioning organizations, it will occur frequently even if there’s little pressure.

If you can reduce pressure too little, you can also reduce pressure too much. One of my biggest regrets from my time at Stripe is that I allowed too little pressure to hit my organization, which over time created a values oasis that operated with a clear plan but also limited pressure. When I left, the pressure regulator came off, and my organization had a rough patch learning to operate in the new circumstances.

Altogether, this balance is difficult to maintain. I’m still getting better at it slowly over time, learning mostly from mistakes. As a final thought here, respecting physics doesn’t necessarily mean doing what engineers want you to do: those who speak for physics aren’t necessarily right. Instead, it’s making a deliberate, calculated tradeoff between the two that’s appropriate to the circumstances. Sometime’s that courageously pushing back on an impossible timeline, sometimes it’s firing a leader who insists change is impossible.

How to create software quality.

Sun, 16 Jun 2024 16:00:00 -0700

I’ve been reading Steven Sinofsky’s Hardcore Software, and particularly enjoyed this quote from a memo discussed in the Zero Defects chapter:

You can improve the quality of your code, and if you do, the rewards for yourself and for Microsoft will be immense. The hardest part is to decide that you want to write perfect code.

If I wrote that in an internal memo, I imagine the engineering team would mutiny, but software quality is certainly an interesting topic where I continue to refine my thinking. There are so many software quality playbooks out there, and I increasingly believe that all these playbooks work in their intended context, but are often misapplied.

For example, pretty much every startup has someone on an infrastructure team who believes that all quality problems can be solved with a sufficiently nuanced automated rollout strategy. That’s generally been true in my experience at companies with a high volume of engaged usage, and has not at all been true in environments with low or highly varied usage. Unsurprisingly, folks who’ve only seen high volume scenarios tend to overestimate the value of rollout techniques, and folks who’ve never seen high volume scenarios tend to underestimate the value of rollout techniques.

This observation is the underpinning of my beliefs about creating software quality. Expanding from that observation, I’ll try to convince you of two things:

Creating quality is context specific. There are different techniques for solving essential domain complexity, scaling complexity, and accidental complexity.

For example, phased automated rollouts don’t help much if there’s little consistency among your users’ behaviors.
Quality is created both within the development loop and across iterations of the development loop. Feedback within the loop, as opposed to across iterations, creates quality more quickly. Feedback across iterations tends to measure quality, which informs future quality, but does not necessarily create it.

For example, bugs detected after software is nominally complete tend to be fixed locally, even if they reveal a suboptimal approach. I’ve seen projects launch using Redis for no reason, which later causes a production incident, just because the developer was interested in learning about Redis and it was too late to rip it out without doing substantial rework.

Those are some nice words. Let’s see if I can convince you that they’re meaningful words.

Defining quality

Generally I think quality is in the eye of the beholder, but my experience writing for the internet indicates that people will be upset if I don’t supply a definition of quality, so here’s my working direction:

Software behaves as its users anticipate it should behave
Software is easy to modify
Software meets reasonable non-functionality requirements (latency, cost, etc)

There are, undoubtedly, better definitions out there, and feel free to insert yours.

Kinds of complexity

Managing quality is largely about finding useful ways to deal with complexity. For example, a codebase might be complex due to its size. Complexity in large codebases can be managed by using a strongly typed language, increasing test coverage, and so on.

I find it useful to recognize whether complexity is mostly driven by high scale (e.g. you’re performing 10,000s or 100,000s of requests per second) or whether complexity is mostly driven by a complex business domain (e.g. you’re trying to capture the intent of open-ended business contracts into a structured database). I think of the former as “scale complexity” and the later as “essential domain complexity”, extending the phrase from Fred Brooks’ No Silver Bullet.

The third common sort of complexity is accidental complexity. If you’ve used a bunch of different technologies for each part of your product because your team was excited to “try out something new”, then your problems are likely of the accidental variety.

Creating quality is context-specific

My experience is that most folks in technology develop strongly-held opinions about creating quality that anchor heavily on their first working experiences. For example, many folks whose early jobs included:

high-volume websites or APIs believe that automated rollbacks driven by production metrics are a fundamental mechanism to prevent a low quality release from impacting users. However, in a domain with infrequent, complex and precise user actions–very much the case for Carta’s domains of fund accounting and managing cap tables–those techniques are less helpful. They certainly might prevent some issues, but they’d fail to prevent many others because the volume of user actions is insufficient to test the full cardinality of potential configurations
high-volume consumer-facing applications believe that A/B testing can determine quality. However, Michelle Bu’s Eagerly discerning, discerningly eager discusses how many tools for validating quality don’t apply effectively when it comes to API design. She proposes friction logs as a replacement for user interviews, pilot programs as a replacement for beta testing, and dogfooding as a replacement for A/B testing
highly critical and narrow problem spaces might benefit from multiple heterogeneous implementations voting to determine the correct answer in the face of software errors, as described in Gloria Davis’ 1987 paper An Analysis of Redundancy Management Algorithms for Asynchrous Fault Tolerant Control Systems. However, maintaining and verifying multiple heterogeneous implementations would be extremely costly for broad problem domains that are expected to change frequently (the typical problem domain for startups)

The key observation here is that there’s no universal solution that “just works” across all problem domains. Even the universally accepted ideal of “have a highly conscientious and high-context engineer do it all themselves” isn’t a viable solution in an environment where the current team is too small for the desired volume of work.

Generally, I think your approach to creating quality will vary on these dimensions:

Essential domain complexity – a problem domain with few workflows and conditions can often be validated by looking at user usage (e.g. an application like Instagram, TikTok or Calm has very few workflows and conditions in the software, although the variety of mobile devices they need to run on does increase essential complexity). A problem domain with many workflows or many conditions requires a different approach, which might range from simulating usage to formal specification.
Scalability complexity – sufficient user traffic solves a lot of problems. At Calm, Uber or Stripe, the scale was high enough that errors were immediately visible in operational data, even with an incremental rollout strategy. At Carta, that would only be true in the case of an exceptionally broad error (a category of error that’s relatively straightforward to catch with automated testing), given the different user usage patterns.
Maturity and tenure of team – a team with deep context in the problem domain is able to drive quality with fewer distinct roles, more individual empowerment, and less structured process. A team with high turnover generally leans on more defined roles where it’s easier to quickly develop context. Similarly, a team that feels accountable for high-quality will behave differently than a team that feels otherwise.

To develop my point a bit, let’s think about two combinations and how we’d approach quality differently:

Low essential complexity problem domain, high traffic, and deep team context – a thorough test suite will go a long way to validating the expected behavior, and the team has enough familiarity with the problem space for engineers to effectively test as they implement. If you do miss something, an incremental rollout mechanism, using production operating metrics to pace rollout, will probably prevent testing gaps from significantly impacting your users
High essential complexity problem domain, low traffic, and limited team context – designing effective test suites is a challenge because of your team’s limited context and the complex problem space. This means you may need dedicated individuals working on testing frameworks to encapsulate domain context into a reusable testing harness/framework rather than relying on individual team member’s context. It might also mean a focus on highly defined types that codify parts of the domain context into the typechecker itself rather than depending on the awareness of each individual writing incremental code. Incremental rollout mechanisms are unlikely to catch issues missed by testing, because conditions differ enough across users that even a small number of catastrophically failing users are unlikely to meet the necessary thresholds

These are radically different approaches, and if you naively applied the solution for one combination to another, you will generate a lot of motion, but are unlikely to generate much impact. Once you’re aware of these combinations, you can start to see what sort of techniques are adopted by folks working on similar problems, but the awareness also helps you start to build a model for creating quality across various circumstances.

Is tolerance for error an important dimension?

Before developing a model for reasoning about creating quality, a few comments on “tolerance for error.” I think many people would argue that your tolerance for error is an important dimension in determining your approach to creating quality, but to be honest I haven’t found that a useful dimension.

For example, if you tell me that you highly prioritize quality at your company, and that’s why you have a very large quality assurance organization, or a very formal verification process, my first response would be skepticism. Quality assurance teams are extremely useful in many scenarios. Formal verification processes are also very useful in many scenarios. However, neither is universally ideal, and both can be done poorly.

In my personal experience, the highest quality organizations are those with detail oriented executives who actively inspect quality in their teams’ work. That executive-driven inspection elevates quality into first-tier work done by their leadership, and so on down the chain. In those organizations, there are often teams that build quality assurance tools, but those teams support quality rather than being directly accountable for it.

Others’ personal experiences will be the opposite, and that’s entirely the point: I’ve seen low tolerance for error drive a wide variety of different approaches. Similarly, I’ve seen organizations with a high tolerance for errors both go heavy on quality assurance and heavy on engineer-led quality. As such, I don’t think this is an important dimension when reasoning about creating quality.

Iterating on a model

My model of creating quality started as a model for reliability.

There’s a lot to like about this model, I think the idea of “latent incidents” is a particularly useful one, because it acknowledges that even if you improve your quality practices, it may take a very long time to drain the backlog of latent incidents such that you actually feel like you’ve improved quality. I imagine many teams actually solve their quality problem but accidentally abandon their successful approach before they drain the backlog and realize they’ve solved it.

I made the above model at a period when I was almost entirely focused on accidental complexity from a sprawling codebase and scaling complexity from an increasingly large volume of concurrent usage. Because of that focus, I spent too little time thinking about the third source of complexity, essential domain complexity.

What I’d like to do here is to develop a model that incorporates both the insights of my prior model and also the fact that essential domain complexity has been the largest source of quality issues in my current and prior roles at Carta and Calm.

I think it’s useful to start with the smallest possible loop for iterating on software, the Read Eval Print Loop (aka REPL), first popularized in 1964.

Not much modern development is done directly in a REPL, but developer-led testing provides a very similar loop, with a single developer looping from writing a piece of code, to writing tests for that code, to running the tests, to addressing the feedback raised by those tests.

Now let’s try to introduce the fact that this tight iteration loop is just one phase of shipping software. After writing software, we also need to release that software into production. After all, as Steve Jobs said, real artists ship!

In particular, it’s worth noticing the loops that occur within one node versus the loops that occur across nodes. It’s significantly faster to iterate within a node than it is to iterate across nodes.

While we’re starting to get a bit closer, but we’re missing a few key things here to reason about the quality of our code. First, even after releasing code, there can be a defect rate where the implementation doesn’t wholly solve the domain’s essential complexity, such that there is a latent defect. Alternatively, there might be an emergent issue in production caused by either accidental complexity (e.g. sloppy environment setup) or scaling complexity (e.g. unexpected traffic spike).

Second, sometimes engineers don’t understand the feature they’re trying to implement, even when they’re trying hard to do so. This might be because they’re new to the team and it’s a complex problem space. Or it might be because they had a miscommunication with their product manager, who is responsible for defining the required functionality.

This model starts to get interesting! The first thing to note is just how delayed the feedback is from writing software to rewriting software if that feedback requires releasing the software. If the handoff of specification from product to engineer goes awry, it may take weeks to detect the issue. This is even more profound in “high cardinality” problem domains where there’s a great deal of divergence in user usage and user data: it may take months or quarters for the feedback to reach the developer about something they did wrong, at which point they–at best–have forgotten much of their original intentions.

Like any good model, you can iterate on it endlessly to capture the details that are most interesting for your situation:

If you’re mostly focused on scalability complexity, then the release process is likely particularly interesting for you.
If you’re focused on accidental complexity, then proactive and reactive controls on software design–mechanisms that gate access to and departure from the software development cluster–would be the most interesting such as architecture reviews or verifying properties within pull requests before merging them.
If you’re focused on essential domain complexity, then you’re focused on either feature specification or the software development feedback loop. Techniques to address might vary from embedding new hires onto teams with high context to requiring pull requests be reviewed by a domain expert, to developming a comprehensive test harness that makes it easy for developers to test new functionality against the full spectrum of unusual scenarios.

The best way to get a feel for any model is to experiment with it. Delete pieces that aren’t interesitng to you, add pieces that seem to be missing from your perspective, and see what it teaches out.

Measurement contributes to (!= creates) quality

One important observation from this model is that errors detected in production, or even in release, are much harder to address effectively than errors detected by the engineer writing the software initially. I think of detecting errors after the software engineer handoff primarily as measuring quality rather than creating quality. My reason for this distinction is that any improvement from this measurement occurs in a later iteration of the loop, as opposed to within the current loop.

I think this is an important distinction because it provides the vocabulary to discuss the role of software engineering teams and the role of quality assurance (QA) teams.

Software engineering teams write software to address problem domain and scaling complexity. Done effectively, developer-led testing happens within the small, local development loop, such that there’s no delay and no coordination overhead separating implemention and verification In that way, developer-led testing directly contributes to quality in the stage that it’s written.

QA teams write software (or run manual processes) to measure the quality of that software in the problem domain. QA-led testing happens in a distinct step, even if that step occurs concurrently, and as such the software’s initial design is not influenced by QA tests. That influences only occurs when the quality loop next repeats.

This is an important distinction, because the later in development an issue is detected, the more likely it is that it’s addressed tactically rather than structurally. Quality issues detected late are more likely to drive improvement in specific correctness (fix the test case) rather than in fundamental approach (redesign the architecture). It also doesn’t mean that QA-led testing isn’t valuable, it’s very valuable for managing the sort of cross-feature bugs that an engineer with narrow context would not know to test for, but it does mean that developer-led testing of their current work creates quality sooner and in ways that QA-led testing does not.

Trying to ground this observation in something specific, think about John Ousterhout’s idea of defining errors out of existence from Philosophy of Software Design. That principle argues that you can eliminate many potential software errors by designing interfaces which prevent the error from occuring. For example, instead of throwing an error when I attempt to delete a non-existent file, it might simply confirm the file does not exist. QA-led testing might ensure that the function throws the error only at the correct times, but it would only be developer-led design (potentially including developer-led testing or dogfooding) that would allow the quick iteration loop that supports changing the interface entirely to define that error out of existence.

As an aside, this is more-or-less the same point I tried to make in my contribution to GitHub ReadME, but there I was focused on incident response. Measuring prior incidents and instances of incident response are an input to improving future incident response, but do not directly improve reliability: only finishing projects that increase reliability does that, and investing more into measurement when you aren’t completing any projects doesn’t solve anything.

What should you do?

The intended takeaway from all this is exactly where we started in the introduction: creating quality is context specific. Be wary of following the playbook you’ve seen before, even if those playbook were tremendously successful. They might work extremely well, but they often don’t unless you have a useful model for reasoning about why they worked in the former environment.

Throughout this piece, I’ve tried to explain and references ideas as I’ve invoked them, but here are some of the materials that might be worth reading through if this is an interesting topic to you:

“Define errors out of existence” is an idea from John Ousterhout’s Philosophy of Software Design, and described with some great examples in this page from the TCL Lang wiki
The distinction between essential and accidental complexity, discussed in Fred Brooks’ No Silver Bullet from Mythical Man Month is a valuable dimension for reasoning about complexity (and consequently, quality)
Eagerly discerning, discerningly eager is a great piece from Michelle Bu that discusses how API design is distinct from many other sorts of product design (e.g. can’t use A/B testing, but can work with other companies as design partners)
Kent Beck’s Tidy First? discusses a number of strategies for addressing accidental quality issues within a codebase, mostly those caused by inconsistent implementations across a large codebase. I wrote up some notes on this book a while back
Reclaim unreasonable software captures my thinking about creating quality in a codebase that has become challenging to reason about. In this piece’s vocabulary, it’s most interested in solving accidental and scaling complexity
Manage technical quality is a chapter from Staff Engineer which describes many of the tools you can use to improve quality. Most of the discussion here is composition-agnostic, e.g. useful techniques that might or might not apply to various compositions, and certainly you can use the quality model to evaluate which might apply well for your circumstances
Domain-Driven Design Distilled by Vaughn Vernon is a good overview of domain-driven design, which is an approach to software development that applies particularly well to working in problem domains with high essential complexity
Practical TLA+ by Hillel Wayne is a useful introduction to formal specification, which is a topic that not many software engineers spend time thinking about, but an interesting one nonetheless
Building Evolutionary Architectures by Ford, Parsons, Kua and Sadalage has a number of ideas about guided evolution of codebases (here are my notes on the 1st edition). Generally composition-agnostic in their recommendations

These are all well worth your time.

Video of Using LLMs in your product.

Fri, 14 Jun 2024 07:00:00 -0700

A month ago, I wrote up some notes on using LLMs in your product, and yesterday I got to present an iteration on those notes to the folks at the Sapphire Venture’s 2024 Hypergrowth Engineering Summit.

If you’re interested, you can watch a recording of my talk on Youtube. There’s a lot of overlap with the notes, but I also go into Carta’s approach thus-far to incorporating LLMs into our product. (Note that it’s a recording of a practice run I did earlier in the week, not a recording from the venue itself, so it’s definitely amateur quality but the content is still all there!)

No Wrong Doors.

Wed, 22 May 2024 16:00:00 -0700

Some governmental agencies have started to adopt No Wrong Door policies, which aim to provide help–often health or mental health services–to individuals even if they show up to the wrong agency to request help. The core insight is that the employees at those agencies are far better equipped to navigate their own bureaucracies than an individual who knows nothing about the bureaucracy’s internal function.

For the most part, technology organizations are not complex bureaucracies, but sometimes they do seem to operate that way. A particularly common pattern is along the lines of:

Product Engineer joins #observability
Product Engineer: Hey, I’m having trouble with alerts, can you help me with that?
Obs Engineer: Oh yeah, for sure, what alerts?
Product Engineer: Ok, so there at this Datadog link…
Obs Engineer: Got it. Yeah, so that’s in the SRE Obs team now. We do observability for the product analytics data lake, not production observability.
Obs Engineer: Ok. Yeah, good. Let me find SRE Obs
Product Engineer joins #sre-obs
Product Engineer: Hi, I got steered here by #observability, I think this is where I can get help with issues like Datadog link…
SRE Engineer: Oh, absolutely. That looks misconfigured. Where is your app completed Observability Checklist and when did you review it with us?
Product Engineer: …how would I know that?

In that example, the product engineer is first forced to navigate the unintuitive organizational design to find the right team for questions about Datadog. After they find the right team, they are forced to figure out how the SRE Observability team records when a checklist is completed. In almost all cases, the product engineer ends up frustrated, but it’s not just them. Almost every time, the observability engineer and SRE engineer also probably feel frustrated that the product engineer didn’t know enough to navigate their bureaucracy successfully.

Something I’ve been thinking about recently is how engineering organizations can adopt a variant of the No Wrong Doors policy to directly connect folks with problems with the right team and information. Then the first contact point becomes a support system for navigating the bureaucracy successfully.

For example, imagine if this had happened instead:

Product Engineer joins #observability
Product Engineer: Hey, I’m having trouble with alerts, can you help me with that?
Obs Engineer: Oh yeah, for sure, what alerts?
Product Engineer: Ok, so there at this Datadog link…
Obs Engineer: Got it. Yeah, let me start a thread in #sre-obs to help get this sorted
Product Engineer joins #sre-obs
Obs Engineer: Hey all, Product Engineer is having trouble with Datadog (see link here). Product Engineer: if you look into this spreadsheet you can find the Observability Checklist entry for your app to add to this thread to help with debugging
SRE Engineer: OK, so..

Now the product engineer gets support from the same two folks as before, but because they’re helping the product engineer navigate the process, they get to a better situation.

Beyond being helpful to your colleagues, which is an obvious goal in some companies and not-at-all a cultural priority in others, I think there are a number of other advantages to think about here. First, being helpful creates positive relationships across organizations. Second, it makes it more obvious where you do have genuine areas of ambiguous ownership, and makes it possible for informed parties to escalate that rather than relying on folks with the least context to know to escalate the ambiguities. Third, it educates folks asking for help about the right thing to do, because a knowledgeable person helping is a great role model of the best way to solve a problem. Finally, if you happen to route to the wrong person–it happens!–then you learn that immediately rather than forcing someone without context to navigate the confusion.

The most effective mechanism I’ve found for rolling out No Wrong Door is initiating three-way conversations when asked questions. If someone direct messages me a question, then I will start a thread with the question asker, myself, and the person I believe is the correct recipient for the question. This is particularly effective because it’s a viral approach: rolling out No Wrong Door just requires any one of the three participants to adopt the approach. Even the question asker can do it, although the power dynamics of the interaction do make it a bit harder for them.

Making engineering strategies more readable

Sat, 18 May 2024 04:00:00 -0700

As discussed in Components of engineering strategy, a complete engineering strategy has five components: explore, diagnose, refine (map & model), policy, and operation. However, it’s actually quite challenging to read a strategy document written that way. That’s an effective sequence for creating a strategy, but it’s a challenging sequence for those trying to quickly read and apply a strategy without necessarily wanting to understand the complete thinking behind each decision.

This post covers:

Why the order for writing strategy is hard to reading strategy
How to organize a strategy document for reading
How to refactor and merge components for improved readability
Additional tips for effective strategy documents

After reading it, you should be able to take a written strategy and rework it into a version that’s much easier for others to read.

This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.

Why writing structure inhibits reading

Most software engineers learn to structure documents early in their lives as students writing academic essays. Academic essays are focused on presenting evidence to support a clear thesis, and generally build forward towards their conclusion. Some business consultancies explicitly train their new hires in business writing, such as McKinsey teaching Barbara Minto’s The Pyramid Principle, but that’s the exception.

While academic essays want to develop an argument, professional writing is a bit different. Professional writing typically has one of three distinct goals:

Refining thinking about a given approach (“how do we select databases for our new products?”) – this is an area where the academic structure can be useful, because it focuses on the thinking behind the proposal rather than the proposal itself
Seeking approval from stakeholders or executives (“what database have we selected for our new analytics product?”) – this is an area where the academic structure creates a great deal of confusion, because it focuses on the thinking rather than the specific proposal, but the stakeholders view the specific proposal are the primary topic to review
Communicating a policy to your organization (“databases allowed for new products”) – helping engineers at your company understand the permitted options for a given problem, and also explaining the rationale behind the decision for the subset who may want to undrstand or challenge the current policy

The ideal format for the first case is generally at odds with the other two, which is a frequent cause of strategy documents which struggle to graduate from brainstorm to policy. I find that most strategy writers are resistent to the idea that it’s worth their time to restructure their initial documents, so let me expand on challenges I’ve encountered when I’ve personally tried to make progress without restructuring:

Too long, didn’t read. Thinking-orienting structures leave policy recommendations at the very bottom, but the vast majority of strategy readers are simply trying to understand that policy so they can apply it to their specific problem at hand. Many of those readers, in my experience a majority of them, will simply give up before reading the sections that answer their questions and assume that the document doesn’t provide clear direction because finding that direction took too long.

This is very much akin to the core lesson of Steve Krug’s Don’t Make Me Think: users (and readers) don’t understand, they muddle through. Assuming that they will take the time to deeply understand is an act of hubris.
Approval meeting to nowhere. There are roughly three types of approval meetings. The first, you go in and no one has any feedback. Maybe someone gripes that it could have been done asynchrously instead of a meeting, but your docuemnt is approved. The second, the are two sets of stakeholders with incompatible goals, and you need a senior decision-maker to mediate between them. This is a very useful meeting, because you generally can’t make progress without that senior decision-maker breaking the tie.

The third sort of meeting is when you get derailed early with questions about the research, whether you’d considered another option, and whether this is even relevant. You might think this is because your strategy is wrong, but in my experience it’s usually because you failed to structure the document to present the policy upfront. Stakeholders might disagree with many elements of your thinking but still agree with your ultimate policy, but it’s only useful to dig into your rationale if they actually disagree with the policy itself. Avoid getting stuck debating details when you agree on the overarching approach by presenting the policy first, and only digging into those details when there’s disagreement.
Transient alignment. Sometimes you’ll see two distinct strategy documents, with the first covering the full thinking, and the second only including the policy and operations sections. This tends to work quite well initially, but over time existing members of the team depart and new members are hired. At some point, a new member will challenge the thinking behind the strategy as obviously wrong, generally because it’s a different set of policies than they used at the previous employer. If you omit the diagnosis and exploration sections entirely, then they can’t trace through the decision making to understand the reasoning, which will often cause them to leap to simplistic conclusions like the ever popular, “I guess the previous engineers here were just dumb.”

As annoying as each of these challenges is, the solution is simple: use the writing structure for writing, and invert that structure for reading.

Invert structure for reading

Reiterating a point from Components of engineering strategy: it’s always appropriate to change the structure that you use to develop or present a strategy, as long as you are making a deliberate, informed decision.

While I’ve generally found explore, diagnose, refine, policy, and operation to work well for writing policy, I’ve consistently found it a poor format for presenting strategy. Whether I’m presenting a strategy for review or rolling the strategy out to be followed by the wider organizaton, I recommend an inverted structure:

Policy: what does the strategy require or allow?
Operation: how is the strategy enforced and carried out, how do I get exceptions for the policy?
Refine: what were the load-bearing details that informed the strategy?
Diagnose: what are the more generalized trends and observations that steered the thinking?
Explore: what is the high-level, wide-ranging context that we brought into creating this strategy?

When seeking approval, you’ll probably focus on the Policy section. When rolling it out to your organization, you’ll probably focus on the Operation section more. In both cases, those are the critical components and you want them upfront. Very few strategy readers want to understand the full thinking behind your strategy, instead they just want to understand how it impacts the specific decision they are trying to answer.

The vast majority of strategy readers want the answer, not to understand the thinking behind the answer, and these are your least motivated readers. Someone who wants to really understand the thinking will invest time reading through the document, even if it isn’t perfectly structured for them. Someone who just wants an answer will frequently give up and make up an answer rather than reading all the way through to where the document does infact answer their question.

Zooming out a bit, this is a classic “lack of user empathy” problem. Folks authoring the document are so deep in the details that they can’t put themselves in the readers’ mindset to think about how overwhelming the document would be if you were simply trying to pop in, get an answer, and then pop out. This lack of empathy also means that most strategy writers refuse to structure their documents to support the large population of answer seekers over the tiny population of strategy authors, but just try it a few times and I think you’ll see it helps a great deal. Even faster, go read someone else’s strategy document that you aren’t familiar with, and you’ll quickly appreciate how challenging it can be to identify the actual proposal if they follow the academic structure.

Strategy refactoring

Inverting the structure is the first step of optimizing a document for readability, but you don’t have to stop there. Often you’ll find that even the inverted strategy structure is somewhat confusing to read for a given document. I think of this process as “strategy refactoring.”

For example, How should you adopt LLMs? makes two refactors to the inverted format. First, it merges Refine into Diagnose, which keeps the map and models closer to the specific topics thet explore. Second, it discards the Operation sectiom entirely, and includes the relevant details with the policies they apply to in the Policy section.

Strategy refactoring is about discarding structure where it interferes with usability. The strategy structure is very effective at separating concerns while reasoning through decision making, but most readers benefit more from engaging with the full implications at once. Once you’re done thinking, refactor away the thinking tools: don’t let the best tools for one workflow mislead you into thinking they’re the best for an entirely different one.

Additional tips for effective strategy docs

In addition to the above advice, there are a handful of smaller tips that I’ve found helpful for creating readable strategy documents:

Before releasing a document widely, find someone entirely uninvolved with the strategy thus far and have them point out areas that are difficult to understand. Anyone who’s been thinking about the strategy is going to gloss over areas that might be inscrutinable to those who are approaching it with fresh eyes.
Every strategy document should be rolled out with an explicit commenting period where you invite discussion, as well as office hours where you are available to explain how to apply the strategy correctly. These steps help with adoption, but even more importantly they help you identify disenters who disagree with the strategy such that you can follow up to better understand their concerns.
Every company should maintain its own internal engineering strategy template, along the lines of this book’s engineering strategy template.
Your template should include consistent metadata, particularly when it was created, the current approval status, and where to ask questions. Of these, a clear, durable place to ask questions is the most important, as it slows the rate that these documents rot.
After you release your strategy, disable in-document commenting. This isn’t intended to prevent further discussion, but rather to move the discussion outside of the document. Nothing creates the impression of an unapproved, unfinished strategy document faster then a long string of open comments. Open comments also make it difficult to read the strategy document, as often the reader will get distracted from reading the document to read the comments.

Summary

After reading this chapter, you know how to escape the rigid structures imposed during the creation of a strategy to create a readable document that is easier for others to both approve and apply. Beyond initially inverting the structure for easier reading, you also understand how to refactor sections away entirely that may have been essential for creation but interfere for understanding how to apply the strategy, which is by far the most common task for strategy readers.

Most importantly, I hope you finish this chapter agreeing that it’s worth your time to rework your thinking-optimized draft rather than leaving it as is. The deliberate refusal to structure documents for readers is the root cause of a surprising number of good strategies that utterly fail to have their intended impact.

How should you adopt LLMs?

Tue, 14 May 2024 06:00:00 -0700

Whether you’re a product engineer, a product manager, or an engineering executive, you’ve probably been pushed to consider using Large Language Models (LLM) to extend your product or enhance your processes. 2023-2024 is an interesting era for LLM adoption, where these capabilities have transitioned into the mainstream, with many companies worrying that they’re falling behind despite the fact that most integrations appear superficial.

That context makes LLM adoption a great topic for a strategy case study. This document is an engineering strategy document determining how a hypothetical company, Theoretical Ride Sharing, could adopt LLMs.

Building out the scenario a bit before diving into the strategy: Theoretical has 2,000 employees, 300 of which are software engineers. They’ve raised $400m, are doing $50m in annual revenue, and are operating in 200 cities across North America and Europe. They are a ride sharing business, similar to Uber or Lyft, but have innovated on the formula by using larger vehicles (also known as, they’ve reinvented public transit).

Reading this document

To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reserve order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, Operation has been folded into Policy; second, Refine has been embedded in Diagnose.

More detail on this structure in Making a readable Engineering Strategy document.

Policy

Our combined policy for using LLMs at Theoretical Ride Sharing are:

Develop an LLM-backed process for verifying I-9 and US Driver License documents such that we can wholly automate driver onboarding in the United States. Moving from an average onboarding delay of seven days to near-instant onboarding will increase driver supply and allow us to reprioritize the team on servicing rider complaints, which are a major source of concern.

Verifying I-9 Forms and US Drivers Licenses will be directly useful for accelerating onboarding, and also establish the framework for us to perform document extraction on in other jurisdictions outside the US to the extent that this experiment outperforms our current hybrid automation/services model for onboarding.

Report on progress monthly in Exec Weekly Meeting, coordinated in #exec-weekly
Start with Anthropic. We use Anthropic models, which are available through our existing cloud provider via AWS Bedrock. To avoid maintain multiple implementations, where we view the underlying foundational model quality to be somewhat undifferentiated, we are not looking to adopt a broad set of LLMs at this point.

Exceptions will be reviewed by the Machine Learning Review in #ml-review
Developer experience team (DX) must offer at least one LLM-backed developer productivity tool. This tool should enhance the experience, speed, or quality of writing software in TypeScript. This tool should help us develop our thinking for next year, such that we have conviction increasing (or decreasing!) our investment. This tool should be available to all engineers. Adopting one tool is the required baseline, if DX identifies further interesting tools, e.g. Github Copilot, they are empowered to bring the request to the Engineering Exec team for review. Review will focus on balancing our rate of learning, vendor cost, and data security. We’ve modeled options for measuring LLMs impact on developer experience.

Vendor approvals to be reviewed in #cto
Internal Toolings team (INT) must offer at least one LLM-backed ad-hoc prompting tool. This tool should support arbitrary non-engineering use cases for LLMs, such as text extraction, rewriting notes, and so on. It must be usable with customer data while also honoring our existing data processing commitments. This tool should be available to all employees.

Vendor approvals to be reviewed in #coo
Refresh policy in six months. Our foremost goal is to learn as quickly as possible about a new domain where we have limited internal expertise, then review whether we should increase our investment afterwards.

Flag questions and suggestions in #cto

Diagnose

The synthesis of the problem at hand regarding how we use LLMs at Theoretical Ride Sharing is:

There are, at minimum, three distinct needs that folks internally are asking us to solve (either separately or with a shared solution):
1. productivity tooling for non-engineers, e.g. ad-hoc document rewriting,document summarization
2. productivity tooling for engineers, e.g. advanced autocomplete tooling like Github Copilot
3. product extensions, e.g. high-quality document extraction in driver onboarding workflows
Of the above, we see product extensions are potential strategic differentiation, and the other two as workflow optimizations that improve our productivity but don’t necessarily differentiate us from wider industry. Some of the opportunities for strategic differentiation we see are:
1. Faster driver onboarding by processing driver documentation without human involvement, making it possible to bring new driver supply online more quickly, particularly as we move into new regions. We’ve sized the potential impact by developing a model of faster driver onboarding
2. Improved customer support by increasing the response speed and quality of our responses to customer inquiries
We currently have limited experience or expertise in using LLMs in the company and in the industry. Prolific thought leadership to the contrary, there are very few companies or products using LLMs in scaled, differentiated ways. That’s currently true for us as well
We want to develop our expertise without making an irreversible commitment. We think that our internal expertise is a limiter for effective problem selection and utilization of LLMs, and that developing our expertise will help us become more effective in iterative future decisions on this topic. Conversely, we believe that making a major investment now, prior to developing our in-house expertise, would be relatively high risk and low reward given no other industry players appear to have identified a meaningful advantage at this point
Switching across foundational models and foundational model providers is cheap. This is true both economically (low financial commitment) and from an integration cost perspective (APIs and usage is largely consistent across providers)
Foundational models and providers are evolving rapidly, and it’s unclear how the space will evolve. It’s likely that current foundational model providers will train one or two additional generations of foundational models with larger datasets, but at some point they will become cost prohibitive to train (e.g. the next major version of OpenAI or Anthropic models seem likely to cost $500m+ to train). Differentiation might move into developer-experience at that point. Open source models like LLaMa might become significantly cost-advantaged. Or something else entirely. The future is wide open.

We’ve built a Wardley map to understand the possible evolution of the foundational model ecosystem.
Training a foundational model is prohibitively expensive for our needs. We’ve raised $400m, and training a competitive foundational model would cost somewhere between $3m to $100m to match the general models provided by Anthropic or OpenAI

Explore

Large Language Models operate on top of a foundational model. Training these foundational models is exceptionally expensive, and growing more expensive over time as competition for more sophisticated models accelerates. Meta allegedly spent $20-30m training LLaMa 2, up from about $3m training costs for LLaMa 1. OpenAI’s GPT-4 allegedly cost $100m to train. With some nuance related to the quality of corpus and its relevance to the task at hand, larger models outperform smaller models, so there’s not much incentive to train a smaller foundational model unless you have a large, unique dataset to train against, and even in that case you might be better off fine-tuning or in-context learning (ICL).

Anthropic charges between $0.25 and $15 per million tokens of input, and a bit more for output tokens. OpenAI charges between $0.50 and $60 per million tokens of input, and a bit more for output tokens. The average English word is about 1.3 tokens, which means you can do a significant amount of LLM work while spending less than most venture funded startups spend on snacks.

There’s significant debate on whether LLMs have reached a point where their performance improvements will slow. Much like the ongoing debate around whether Moore’s Law has died, it’s unclear how much LLM performance will improving going forward. From a cost to train perspective, it’s unlikely that companies can continue to improve foundational models merely by spending more money on compute. A few companies can tolerate a $1B training cost, fewer still a $10B training cost, but it’s hard to imagine a world where any companies are building $100B models. However, algorithmic improvements and investment in datasets may well drive improvements without driving up compute costs. The only high confidence prediction you can make in this space is that it’s likely model improvement will double one or two more times over the next 3 years, after which it might continue doubling at that rate or it might plateau at that level of performance: either outcome is plausible.

For some decisions, there’s a strategic imperative to get it right from the beginning. For example, migrating from AWS to Azure is very expensive due to the degree of customization and lock-in. However, LLMs don’t appear to be in this category. Talking with industry peers, the majority of companies are experimenting with a variety of models from Anthropic, OpenAI and elsewhere (e.g. Mistral). Behaviors do vary across models, but it’s also true that behavior of existing models varies over time (e.g. GPT 3.5 allegedly got “lazier” over time), which means the overhead of dealing with model differences is unavoidable even if you only adopt one. Altogether, vendor lock-in for models is very low from a technical perspective, although there is some lock-in created by regulatory overhead, for example it’s potentially painful to update your Data Processing Agreement multiple times, combined with the notification delay, to support multiple model vendors.

Although there’s an ongoing investment boom in artificial intelligence, most scaled technology companies are still looking for ways to leverage these capabilities beyond the obvious, widespread practices like adopting Github Copilot. For example, Stripe is investing heavily in LLMs for internal productivity, including presumably relying on them to perform some internal tasks that would have previously been performed by an employee such as verifying a company’s website matches details the company supplied in their onboarding application, but it’s less clear that they have yet found an approach to meaningfully shift their product, or their product’s user experience, using LLMs.

Looking at ridesharing companies more specifically, there don’t appear to be any breakout industry-specific approaches either. Uber is similarly adopting LLMs for internal productivity, and some operational efficiency improvements as documented in their August, 2023 post describing their internal developer and operations productivity investments using LLMs and May, 2024 post describing those efforts in more detail.