Notes on how to use LLMs in your product.
Pretty much every company I know is looking for a way to benefit from Large Language Models. Even if their executives don’t see much applicability, their investors likely do, so they’re staring at the blank page nervously trying to come up with an idea. It’s straightforward to make an argument for LLMs improving internal efficiency somehow, but it’s much harder to describe a believable way that LLMs will make your product more useful to your customers.
I’ve been working fairly directly on meaningful applicability of LLMs to existing products for the last year, and wanted to type up some semi-disorganized notes. These notes are in no particular order, with an intended audience of industry folks building products.
You can watch a recording of my talk based on these on Youtube, or see the slides.
Rebuild your mental model
Many folks in the industry are still building their mental model for LLMs, which leads to many reasoning errors about what LLMs can do and how we should use them. Two unhelpful mental models I see many folks have regarding LLMs are:
- LLMs are magic: anything that a human can do, an LLM can probably do roughly as well and vastly faster
- LLMs are the same as reinforcement learning: current issues with hallucinations and accuracy are caused by small datasets. Accuracy problems will be solved with larger training sets, and we can rely on confidence scores to reduce the impact of inaccuracies
These are both wrong in different but important ways. To avoid falling into those mental model’s fallacies, I’d instead suggest these pillars for a useful mental model around LLMs:
- LLMs can predict reasonable responses to any prompt – an LLM will confidently provide a response to any textual prompt you write, and will increasingly provide a response to text plus other forms of media like image or video
- You cannot know whether a given response is accurate – LLMs generate unexpected results, called hallucinations, and you cannot concretely know when they are wrong. There are no confidence scores generated that help you reason about a specific answer from an LLM
- You can estimate accuracy for a model and a given set of prompts using evals – You can use evals – running an LLM against a known set of prompts, recording the responses, and evaluating those responses – to evaluate the likelihood that an LLM will perform well in a given scenario
- You can generally increase accuracy by using a larger model, but it’ll cost more and have higher latency – for example, GPT 4 is a larger model than GPT 3.5, and generally provides higher quality responses. However, it’s meaningfully more expensive (~20x more expensive), and meaningfully slower (2-5x slower). However, the quality, cost and latency are improving at every price point. You should expect the year-over-year performance at a given cost, latency or quality point to meaningfully improve over the next five years (e.g. you should expect to get GPT 4 quality at the price and latency of GPT 3.5 in 12-24 months)
- Models generally get more accurate as the corpus it’s built from grows in size – the accuracy of reinforcement learning tends to grow predictability as the dataset grows. That remains generally true for LLMs, but is less predictable. Small models generally underperform large models. Large models generally outperform small models with higher quality data. Supplementing large general models with specific data is called “fine-tuning” and it’s currently ambiguous when fine-tuning a smaller model will outperform using a larger model. All you can really do is run evals based on the available models and fine-tuning datasets for your specific usecase
- Even the fastest LLMs are not that fast – even a fast LLM might take 10+ seconds to provide a reasonably sized response. If you need to perform multiple iterations to refine the initial response, or to use a larger model, it might take a minute or two to complete. These will get faster, but they aren’t fast today
- Even the most expensive LLMs are not that expensive for B2B usage. Even the cheapest LLM is not that cheap for Consumer usage – because pricing is driven by usage volume, this is a technology that’s very easy to justify for B2B businesses with smaller, paying usage. Conversely, it’s very challenging to figure out how you’re going to pay for significant LLM usage in a Consumer business without the risk of significantly shrinking your margin
These aren’t perfect, but hopefully they provide a good foundation for reasoning about what will or won’t work when it comes to applying LLMs to your product. With this foundation in place, now it’s time to dig into some more specific subtopics.
Revamp workflows
The workflows in most modern software are not designed to maximize benefit from LLMs. This is hardly surprising–they were built before LLMs became common–but it does require some rethinking about workflow design.
To illustrate this point, let’s think of software for a mortgage provider:
- User creates an account
- Product asks user to fill in a bunch of data to understand the sort of mortgage user wants and user’s eligibility for such a mortgage
- Product asks user to provide paperwork to support the data user just provided, perhaps some recent paychecks, bank account balances, and so on
- Internal team validates the user’s data against the user’s paperwork
In that workflow, LLMs can still provide significant value to the business, as you could increase efficiency of validating the paperwork matching with the user supplied information, but the user themselves won’t see much benefit other than perhaps faster validation of their application.
However, you can adjust the workflows to make them more valuable:
- User creates an account
- Product asks user to provide paperwork
- Product uses LLM to extract values from paperwork
- User validates the extracted data is correct, providing some adjustments
- Internal team reviews the user’s adjustments, along with any high risk issues raised by a rule engine of some sort
The technical complexity of these two products is functionally equivalent, but the user experience is radically different. The internal team experience is improved as well. My belief is that many existing products will find they can only significantly benefit their user experience from LLMs by rethinking their workflows.
Retrieval Augmented Generation (RAG)
Models have a maximum “token window” of text that they’ll consider in a given prompt. The maximum size of token windows are expanding rapidly, but larger token windows are slower to evaluate and cost more to evaluate, so even the expanding token windows don’t solve the entire problem.
One solution to navigate large datasets within a fixed token window is Retrieval Augmented Generation (RAG). To come up with a concrete example, you might want to create a dating app that matches individuals based on their free-form answer to the question, “What is your relationship with books, tv shows, movies and music, and how has it changed over time?” No token window is large enough to include every user’s response from the dating app’s database into the LLM prompt, but you could find twenty plausible matching users by filtering on location, and then include those twenty users’ free-form answers, and match amongst them.
This makes a lot of sense, and the two phase combination of an unsophisticated algorithm to get plausible components of a response along with an LLM to filter through and package the plausible responses into an actual response works pretty well.
Where I see folks get into trouble is trying to treat RAG as a solution to a search problem, as opposed to recognizing that RAG requires useful search as part of its implementation. An effective approach to RAG depends on a high-quality retrieval and filtering mechanism to work well at a non-trivial scale. For example, with a high-level view of RAG, some folks might think they can replace their search technology (e.g. Elasticsearch) with RAG, but that’s only true if your dataset is very small and you can tolerate much higher response latencies.
The challenge, from my perspective, is that most corner-cutting solutions look like they’re working on small datasets while letting you pretend that things like search relevance don’t matter, while in reality relevance significantly impacts quality of responses when you move beyond prototyping (whether they’re literally search relevance or are better tuned SQL queries to retrieve more appropriate rows). This creates a false expectation of how the prototype will translate into a production capability, with all the predictable consequences: underestimating timelines, poor production behavior/performance, etc.
Rate of innovation
Model performance, essentially the quality of response for a given budget in either dollars or milliseconds, is going to continue to improve, but it’s not going to continue improving at this rate absent significant technology breakthroughs in the creation or processing of LLMs. I’d expect those breakthroughs to happen, but to happen less frequently after the first several years, and slow from there. It’s hard to determine where we are in that cycle because there’s still an extraordinary amount of capital flowing into this space.
In addition to technical breakthroughs, the other aspect driving innovation is building increasingly large models. It’s unclear if today’s limiting factor for model size is availability of Nvidia GPUs, larger datasets to train models upon that are plausibly legal, capital to train new models, or financial models suggesting that the discounted future cashflow from training larger models doesn’t meet a reasonable payback period. My assumption is that all of these have or will be the limiting constraint on LLM innovation over time, and various competitors will be best suited to make progress depending on which constraint is most relevant. (Lots of fascinating albeit fringe scenarios to contemplate here, e.g. imagine a scenario where the US government disbands copyright laws to allow training on larger datasets because it fears losing the LLM training race to countries that don’t respect US copyright laws.)
It’s safe to assume model performance will continue to improve. It’s likely true that performance will significantly improve over the next several years. I find it relatively unlikely to assume that we’ll see a Moore’s Law scenario where LLMs continue to radically improve for several decades, but lots of things could easily prove me wrong. For example, at some point nuclear fusion is going to become mainstream and radically change how we think about energy utilization in ways that will truly rewrite the world’s structure, and LLM training costs could be one part of that.
Human-in-the-Loop (HITL)
Because you cannot rely on LLMs to provide correct responses, and you cannot generate a confidence score for any given response, you have to either accept potential inaccuracies (which makes sense in many cases, humans are wrong sometimes too) or keep a Human-in-the-Loop (HITL) to validate the response.
As discussed in the workflow section, many companies already have humans performing validation work who can now move into supervision of LLM responses rather than generating the responses themselves. In other scenarios, it’s possible to adjust your product’s workflows to rely on external users to serve as the HITL instead. I suspect most products will depend on both techniques along with heuristics to determine when internal review is necessary.
Hallucinations and legal liability
As mentioned before, LLMs often generate confidently wrong responses. HITL is the design principle to prevent acting on confidently wrong responses. This is because it shifts responsibility (specifically, legal liability) away from the LLM itself and to the specific human. For example, if you use Github Copilot to generate some code that causes a security breach, you are responsible for that security breach, not Github Copilot. Every large-scale adoption of LLMs today is being done in a mode where it shifts responsibility for the responses to a participating human.
Many early-stage entrepreneurs are dreaming of a world with a very different loop where LLMs are relied upon without a HITL, but I think that will only be true for scenarios where it’s possible to shift legal liability (e.g. Github Copilot example) or there’s no legal liability to begin with (e.g. generating a funny poem based on their profile picture).
“Zero to one” versus “One to N”
There’s a strong desire for a world where LLMs replace software engineers, or where software engineers move into a supervisory role rather than writing software. For example, an entrepreneur wants to build a copy of Reddit, and uses an LLM to implement that implementation. There’s enough evidence that you can assume it’s possible today to go from zero to one on a new product idea in a few weeks with an LLM and some debugging skills.
However, most entrepreneurs lack a deep intuition on operating and evolving software with a meaningful number of users. Some examples:
- Keeping users engaged after changing the UI requires active, deliberate work
- Ensuring user data is secure and meets various privacy compliance obligations
- Providing controls to meet SOC2 and providing auditable evidence of maintaining those controls
- Migrating a database schema with customer data in it to support a new set of columns
- Ratcheting down query patterns to a specific set of allowed patterns that perform effectively at higher scale
All of these are straightforward, basic components of scaling a product (e.g. going from “one to N”) that an LLM is simply not going to perform effectively at, and where I am skeptical that we’ll ever see a particularly reliable LLM-based replacement for skilled, human intelligence. It will be interesting to watch, though, as we see how far folks try to push the boundaries of what LLM-based automation can do to delay the onset of projects needing to hire expertise.
Copyright law
Copyright implications are very unclear today, and will remain unclear for the foreseeable future. All work done today using LLMs has to account for divergent legal outcomes. My best guess is that we will see an era of legal balkanization regarding whether LLM generated content is copyright-able, and longer-term that LLMs will be viewed the same as any other basic technical component, e.g. running a spell checker doesn’t revoke your copyright on the spell checked document. You can make all sorts of good arguments why this perspective isn’t fair to copyright holders whose data was trained on, but long-term I just don’t think any other interpretation is workable.
Data Processing Agreements
One small but fascinating reality of working with LLMs today is that many customers are sensitive to the LLM providers (OpenAI, Anthropic, etc) because these providers are relatively new companies building relatively new things with little legal precedent to derisk them. This means adding them to your Data Processing Agreement (DPA) can create some friction. The most obvious way around that friction is relying on LLM functionality served via your existing cloud vendor (AWS, Azure, GCP, etc).
Provider availability
I used to think this was very important, but my sense is that LLM hosting is already essentially equivalent to other cloud services (e.g. you can get Anthropic via AWS or OpenAI via Azure), and that very few companies will benefit from spending too much time worrying about LLM availability. I do think that getting direct access to LLMs via cloud providers–companies that are well-versed at scalability–is likely the winning pick here as well.
There’s lots of folks out there who have spent more time thinking deeply about LLMs than I have–e.g. go read some Simon Willison–but hopefully the notes here are useful. Curious to discuss if folks disagree with any of these perspectives.