Irrational Exuberancehttps://lethain.com/Recent content on Irrational ExuberanceHugo -- gohugo.ioen-usWill LarsonMon, 19 Jan 2026 10:00:00 -0700Learning from Every's Compound Engineeringhttps://lethain.com/everyinc-compound-engineering/Mon, 19 Jan 2026 10:00:00 -0700https://lethain.com/everyinc-compound-engineering/<p>One of the relatively few AI-native products I use is <a href="https://cora.computer/">Cora.computer</a> which summarizes my personal inbox. It&rsquo;s not perfect, but it&rsquo;s done a much better job than my collection of filters at managing the ever-growing onslaught of spam and unsolicited email that flows in.</p> <p>I&rsquo;ve run into a few issues with Cora, which ended up in me following folks at <a href="https://every.to/">Every</a> to report the issues, and more recently this led me to see their work on <a href="https://every.to/chain-of-thought/compound-engineering-how-every-codes-with-agents">compound engineering</a> and specifically the <a href="https://github.com/EveryInc/compound-engineering-plugin/">compound-engineering-plugin</a>.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2026/compound-eng.png" alt="Screenshot of EveryInc&rsquo;s Compound Engineering summary"></p> </div> <p>Compound Engineering is two extremely well-known patterns, one moderately well-known pattern, and one pattern that I think many practitioners have intuited but have not found a consistent mechanism to implement. Those patterns are:</p> <ol> <li> <p><strong>Plan</strong> is decoupling implementation from research. This is well understood, e.g. Claude&rsquo;s plan mode, although it can certainly be done better or worse by being more specific about which resources to consult (specs, PRDs, RFCS, issues, etc)</p> </li> <li> <p><strong>Work</strong> is implementing a plan. This is well understood, and the core of agentic coding. Again, this can be done better or worse, but much of that depends more on the quality of your codebase, tests, and continuous integration harness than the agent itself</p> </li> <li> <p><strong>Review</strong> is asking the agent to review the changes against your best-practices, and identify ways it could be improved. I think most practitioners have <em>some</em> version of this, but standardization is low, even within a given company.</p> </li> <li> <p><strong>Compound</strong> is asking the agent to summarize its learnings from a given task into a well-defined, structured format (basically a wiki) which is consulted by future iterations of the <strong>plan</strong> pattern. This interplay between the <strong>compound</strong> and <strong>plan</strong> steps creates the compounding mechanism.</p> <p>Many practitioners are implicitly compounding, but it&rsquo;s often done manually through their own work. For example, I&rsquo;d often ask the agent to update our <code>AGENTS.md</code> or skills based on a specific problem encountered in a task, but it required my active attention to notice the issue and suggest incorporation.</p> </li> </ol> <p>Taken together, these four steps <em>are not</em> shocking but <em>are</em> an extremely effective way to convert these intuited best-practices into something specific, concrete, and largely automatic within a company by adding a few commands (e.g. <code>workflow:plan</code>, <code>workflow:review</code>, &hellip;) and updating your <code>AGENTS.md</code> to instruct the agent when and how to use those commands.</p> <p>Implementing within Imprint&rsquo;s frontend and backend monorepos was straightforward, taking about an hour. Most of this was iterating on the last mile of details, for example we want our plans in <code>.claude/plan-*.md</code> format to match our existing <code>.gitignore</code> pattern, and none of it was complex. Most importantly, this frees up a topic that many of our engineers (including me) were trying to find a standard approach. Now we have one, and can move on to the next problem.</p> <p>If recent history is our guide, it&rsquo;s a solid guess that many of the practices in compound engineering will get absorbed into the Claude Code and Cursor harnesses over the next couple of months, at which point using these techniques explicitly will be indistinguishable from folks who are entirely unaware they&rsquo;re using them. But we&rsquo;ll see. Until then, this is a cheap, useful experiment that you can implement in an hour.</p>Sharing Claude transcripts.https://lethain.com/sharing-claude-transcripts/Sat, 10 Jan 2026 07:15:00 -0800https://lethain.com/sharing-claude-transcripts/<p>One of my central premises for supporting <a href="https://lethain.com/company-ai-adoption/">internal adoption of LLMs</a> is that adoption depends on easy discovery of what&rsquo;s possible and what&rsquo;s good. That is why our internal prompts driving agents are stored in a shared Notion database, but it also begged the question: our most advanced prompting and interactions are happening in Claude Code, which are hard to see.</p> <p>Thankfully, Simon Willison previously wrote <a href="https://simonwillison.net/2025/Dec/25/claude-code-transcripts/">a tool to extract transcripts from Claude Code</a> called <a href="https://github.com/simonw/claude-code-transcripts"><code>claude-code-transcripts</code></a>, which we were able to wire together into an internal repository of Claude Code sessions and a viewer on Cloudflare pages (and behind Cloudflare authentication into our SSO).</p> <p>There are three components here. First, an index of all the pages.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2026/claude-sessions-index.png" alt="Claude Sessions index showing transcript archive with contributors and sessions"></p> </div> <p>That page links into the transcripts generated by Simon&rsquo;s tool.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2026/claude-transcript-detail.png" alt="Claude Code transcript detail view showing prompts, messages, and tool calls"></p> </div> <p>Finally, we have an internal CLI named <code>imp</code> that is available on every laptop, which now has an additional tool <code>imp claude share-session</code> that will open <code>claude-code-transcripts</code>, allow you to select a session of choice, and then merge it into the holding repository.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2026/claude-share-session-cli.png" alt="Terminal CLI for sharing Claude sessions"></p> </div> <p>Altogether, this was an hour or two of work, and a bit of an experiment in emergent process design. In the short-term, I am enjoying asking our biggest Claude Code users to share their sessions so that I can cherry-pick their practices.</p>Moved newsletter from Mailchimp to Buttondown.https://lethain.com/newsletter-mailchimp-to-buttondown/Sat, 10 Jan 2026 07:00:00 -0800https://lethain.com/newsletter-mailchimp-to-buttondown/<p>In preparation for the release of <em>An Elegant Puzzle</em>, I set up the page to <a href="https://lethain.com/newsletter/">subscribe to my newsletter on January 20th, 2019</a>, heavily inspired by <a href="https://jvns.ca/blog/2017/12/28/making-a-weekly-newsletter/">Julia Evans&rsquo;s approach</a>. I didn&rsquo;t know anything about releasing a book, but <a href="https://www.briewolfson.com/">Brie Wolfson</a> coached me through it, and having a newsletter to tell folks about the book seemed like a good idea. My blog had already had an RSS feed for ~12 years at that point, but RSS usage has steadily declined since the golden era of the 2000s.</p> <p>Following Julia&rsquo;s post, I set up my newsletter to run on Mailchimp, and that has mostly worked well for me over the following six years. (Looking at Julia&rsquo;s website, it looks like she subsequently moved to Convertkit at some point.) However, over time I kept running into issues with Mailchimp. Those frustrations slowly mounted:</p> <ol> <li>I could not find the &ldquo;welcome to this newsletter&rdquo; template to change the recommended posts. I think this might have been related to them rewriting their UX entirely at some point, but I didn&rsquo;t really want to become a Mailchimp expert just to update this</li> <li>Every time I changed jobs, someone would tell me that I needed to update the address for contact, and each time it felt a little bit harder to find the text field to update it</li> <li>The DMARC changes were confusing to navigate within Mailchimp. DMARC enforcement was absolutely not Mailchimp&rsquo;s fault, but configuring it with Mailchimp was a fairly confusing process. Presumably the documentation is much improved at this point, but wasn&rsquo;t great for me at the point I cut over</li> <li>I was paying $326/month for something that was difficult to tune to work how I wanted. The cost has increased over time, so I wasn&rsquo;t paying this much the entire time, but back-of-envelope I paid Mailchimp somewhere around $15,000 over six years</li> </ol> <p>Every year or so I&rsquo;d considered migrating off Mailchimp to something that was more purpose-built for my needs, but never quite got around to it. However, this year I decided to go ahead and migrate. I did some quick research, landed on <a href="https://buttondown.com/">Buttondown</a> (<a href="https://www.jmduke.com/">whose founder</a> I happened to overlap with at Stripe), and the next newsletter on Wednesday will be coming from Buttondown rather than Mailchimp. I&rsquo;m not quite sure how much I&rsquo;ll end up paying Buttondown, but it&rsquo;ll be either $79 or $139/month.</p> <p>The cutover was very straightforward, including getting to write a bit of Django template syntax for the first time in a decade or so and some DNS setup. Now the imported archive is up at <a href="https://archive.lethain.com/">archive.lethain.com</a>, and I&rsquo;ll start sending this upcoming Wednesday. One small feature that I&rsquo;ve wanted for a long time on Mailchimp is the ability to change the format when the newsletter has one or more than one post in it, which was a quick win on Buttondown.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2026/bdn_config.png" alt="Configuring Buttondown email template."></p> </div> <p>This doesn&rsquo;t mean I have plans to meaningfully change how I&rsquo;ve been newslettering for the past six years, although I hope it&rsquo;ll get a bit more interesting in 2026 versus the prior two years, as I&rsquo;ve <a href="https://lethain.com/2025-in-review/">completed my book publishing goals for the 2020s</a>, and am excited to return to writing more widely about stuff I&rsquo;m working on! There&rsquo;s only so many years of sharing draft chapters before it starts to feel a bit stale.</p> <p>As a final thought, two years ago I think folks would have been confused by my decision to not move to Substack, just like six years ago they would have been confused by my decision not to move to Medium. The answer here is easy for me: my goals remain consistent ownership of my work, on domains I control. If I was writing to directly build a business, I imagine both of those choices would have been much harder, but at this point I&rsquo;m surprisingly anchored to my desire to be <a href="https://lethain.com/writers-who-operate/">an operator who writes</a>, which is where I think the most interesting writing happens.</p>Building internal agentshttps://lethain.com/agents-series/Thu, 01 Jan 2026 09:00:00 -0800https://lethain.com/agents-series/<p>A few weeks ago in <a href="https://lethain.com/company-ai-adoption/">Facilitating AI adoption at Imprint</a>, I mentioned our internal agent workflows that we are developing. This is not the core of Imprint&ndash;our core is powering co-branded credit card programs&ndash;and I wanted to document how a company like ours is developing these internal capabilities.</p> <p>Building on that post&rsquo;s ideas like a company-public prompt library for the prompts powering internal workflows, I wanted to write up some of the interesting problems and approaches we&rsquo;ve taken as we&rsquo;ve evolved our workflows, split into a series of shorter posts:</p> <ol> <li><a href="https://lethain.com/agents-skills/">Skill support</a></li> <li><a href="https://lethain.com/agents-large-files/">Progressive disclosure and large files</a></li> <li><a href="https://lethain.com/agents-context-compaction/">Context window compaction</a></li> <li><a href="https://lethain.com/agents-evals/">Evals to validate workflows</a></li> <li><a href="https://lethain.com/agents-logging/">Logging and debugability</a></li> <li><a href="https://lethain.com/agents-subagents/">Subagents</a></li> <li><a href="https://lethain.com/agents-coordinators/">Code-driven vs LLM-driven workflows</a></li> <li><a href="https://lethain.com/agents-triggers/">Triggers</a></li> <li><a href="https://lethain.com/agents-iterative-refinement/">Iterative prompt and skill refinement</a></li> </ol> <p>In the same spirit as the original post, I&rsquo;m not writing these as an industry expert unveiling best practice, rather these are just the things that we&rsquo;ve specifically learned along the way. If you&rsquo;re developing internal frameworks as well, then hopefully you&rsquo;ll find something interesting in these posts.</p> <h2 id="building-your-intuition-for-agents">Building your intuition for agents</h2> <p>As more folks have read these notes, a recurring response has been, &ldquo;How do I learn this stuff?&rdquo; Although I haven&rsquo;t spent time evaluating if this is the <em>best</em> way to learn, I can share what I have found effective:</p> <ol> <li>Reading a general primer on how Large Language Models work, such as <em><a href="https://www.amazon.com/AI-Engineering-Building-Applications-Foundation/dp/1098166302">AI Engineering</a></em> by Chip Huyen. You could also do a brief tutorial too, you don&rsquo;t need the ability to create an LLM yourself, just a mental model of what they&rsquo;re capable of</li> <li>Build a script that uses a basic LLM API to respond to a prompt</li> <li>Extend that script to support tool calling for some basic tools like searching files in a local git repository (or whatever)</li> <li>Implement a <code>tool_search</code> tool along the lines of <a href="https://www.anthropic.com/engineering/advanced-tool-use">Anthropic Claude&rsquo;s tool_search</a>, which uses a separate context window to evaluate your current context window against available skills and return only the relevant skills to be used within your primary context window</li> <li>Implement a virtual file system, such that tools can operate on references to files that are not within the context window. Also add a series of tools to operate on that virtual file system like <code>load_file</code>, <code>grep_file</code>, or whatnot</li> <li>Support Agent Skills, particularly <code>load_skills</code> tool and enhancing the prompt with available skills</li> <li>Write post-workflow eval that runs automatically after each workflow and evaluates the quality of the workflow run</li> <li>Add context-window compaction support to keep context windows below a defined size Make sure that some of your tool responses are large enough to threaten your context-window&rsquo;s limit, such that you&rsquo;re forced to solve that problem</li> </ol> <p>After working through the implementation of each of these features, I think you will have a strong foundation into how to build and extend these kinds of systems. The only missing piece is supporting <a href="https://lethain.com/agents-coordinators/">code-driven agents</a>, but unfortunately I think it&rsquo;s hard to demonstrate the need of code-driven agents in simple examples, because LLM-driven agents are sufficiently capable to solve most contrived examples.</p> <h2 id="why-didnt-you-just-use-x">Why didn&rsquo;t you just use X?</h2> <p>There are many existing agent frameworks, including <a href="https://platform.openai.com/docs/guides/agents-sdk">OpenAI Agents SDK</a> and <a href="https://platform.claude.com/docs/en/agent-sdk/overview">Claude&rsquo;s Agents SDK</a>. Ultimately, I think these are fairly thin wrappers, and that you&rsquo;ll learn <em>a lot more</em> by implementing these yourself, but I&rsquo;m less confident that you&rsquo;re better off long-term building your own framework.</p> <p>My general recommendation would be to build your own to throw away, and then try to build on top of one of the existing frameworks if you find any meaningful limitations. That said, I really don&rsquo;t regret the decision to build our own, because it&rsquo;s just so simple from a code perspective.</p> <h2 id="final-thoughts">Final thoughts</h2> <p>I think every company should be doing this work internally, very much including companies that aren&rsquo;t doing any sort of direct AI work in their product. It&rsquo;s very fun work to do, there&rsquo;s a lot of room for improvement, and having an engineer or two working on this is a relatively cheap option to derisk things if AI-enhanced techniques continue to improve as rapidly in 2026 as they did in 2025.</p>Building an internal agent: Iterative prompt and skill refinementhttps://lethain.com/agents-iterative-refinement/Thu, 01 Jan 2026 08:30:00 -0800https://lethain.com/agents-iterative-refinement/<p>Some of our internal workflows are being used quite frequently, and usage reveals gaps in the current prompts, skills, and tools. Here is how we&rsquo;re working to iterate on these internal workflows.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-does-iterative-refinement-matter">Why does iterative refinement matter?</h2> <p>When companies push on AI-led automation, specifically meaning LLM agent-driven automation, there are two major goals. First is the short-term goal of increasing productivity. That&rsquo;s a good goal. Second, and I think even more importantly, is the long-term goal of helping their employees build a healthy intuition for how to use various kinds of agents to accomplish complex tasks.</p> <p>If we see truly remarkable automation benefits from the LLM wave of technology, it&rsquo;s not going to come from the first-wave of specific tools we build, but the output of a new class of LLM-informed users and developers. There is nowhere that you can simply acquire that talent, instead it&rsquo;s talent that you have to develop inhouse, and involving more folks in iterative refinement of LLM-driven systems is the most effective approach that I&rsquo;ve encountered.</p> <h2 id="how-are-we-enabling-iterative-refinement">How are we enabling iterative refinement?</h2> <p>We&rsquo;ve taken a handful of different approaches here, all of which are currently in use. From earliest to latest, our approaches have been:</p> <ol> <li> <p><strong>Being responsive to feedback</strong> is our primary mechanism for solving issues. This is both responding quickly in an internal <code>#ai</code> channel, but also skimming through workflows each day to see humans interacting, for better and for worse, with the agents. This is the most valuable ongoing source of improvement.</p> </li> <li> <p><strong>Owner-led refinement</strong> has been our intended primary mechanism, although in practice it&rsquo;s more of the secondary mechanism. We store our prompts in Notion documents, where they can be edited by their owners in real-time. Permissions vary on a per-document basis, but most prompts are editable by anyone at the company, as we try to facilitate rapid learning.</p> <p>Editable prompts alone aren&rsquo;t enough, these prompts also need to be discoverable. To address that, whenever an action is driven by a workflow, we include a link to the prompt. For example, a Slack message sent by a chat bot will include a link to the prompt, as well a comment in Jira.</p> </li> <li> <p><strong>Claude-enhanced, owner-led refinement</strong> via the Datadog MCP to pull logs into the repository where the skills live has been fairly effective, although mostly as a technique used by the AI Engineering team rather than directly by owners. Skills are a bit of a platform, as they are used by many different workflows, so it may be inevitable that they are maintained by a central team rather than by workflow owners.</p> </li> <li> <p><strong>Dashboard tracking</strong> shows how often each workflow runs and errors associated with those runs. We also track how often each tool is used, including how frequently each skill is loaded.</p> </li> </ol> <p>My guess is that we will continue to add more refinement techniques as we go, without being able to get rid of any of the existing ones. This is sort of disappointing&ndash;I&rsquo;d love to have the same result with fewer&ndash;but I think we&rsquo;d be worse off if we cut any of them.</p> <h2 id="next-steps">Next steps</h2> <p>What we don&rsquo;t do yet, but is the necessary next step to making this truly useful, is to include a subjective post-workflow eval that determines whether the workflow was effective. While we have <a href="https://lethain.com/agents-evals/">evals to evaluate workflows</a>, this would be using evals to evaluate individual workflow runs, which would provide a level of very useful detail to understand.</p> <h2 id="how-its-going">How it&rsquo;s going</h2> <p>In our experience thus far, there are roughly three workflow archetypes: chatbots, very well understood iterative workflows (e.g. applying <code>:merge:</code> reacji to merged PRs as discussed in <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a>), and not-yet-well-understood workflows.</p> <p>Once we build a code-driven workflow, they have always worked well for us, because we have built a very focused, well-understood solution at that point. Conversely, chatbots are an extremely broad, amorphous problem space, and I think post-run evals will provide a high quality dataset to improve them iteratively with a small amount of human-in-the-loop to nudge the evolution of their prompts and skills.</p> <p>The open question, for us anyway, is how we do a better job of identifying and iterating on the not-yet-well-understood workflows. Ideally without requiring a product engineer to understand and implement each of them individually. We&rsquo;ve not <em>scalably</em> cracked this one yet, and I do think scalably cracking it is the key to whether these internal agents are <em>somewhat useful</em> (frequently performed tasks performed by many people eventually get automated) and are truly transformative (a significant percentage of tasks, even infrequent ones performed by a small number of people get automated).</p>Building an internal agent: Subagent supporthttps://lethain.com/agents-subagents/Wed, 31 Dec 2025 09:45:00 -0800https://lethain.com/agents-subagents/<p>Most of the extensions to our internal agent have been the direct result of running into a problem that I couldn&rsquo;t elegantly solve within our current framework. Evals, compaction, large-file handling all fit into that category. Subagents, allowing an agent to initiate other agents, are in a different category: I&rsquo;ve frequently thought that we needed subagents, and then always found an alternative that felt more natural.</p> <p>Eventually, I decided to implement them anyway, because it seemed like an interesting problem to reason through. Eventually I would need them&hellip; right? (Aside: I did, indeed, eventually use subagents to support <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a> invoking LLMs.)</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-subagents-matter">Why subagents matter</h2> <p>&ldquo;Subagents&rdquo; is the name for allowing your agents to invoke other agents, which have their own system prompt, available tools, and context windows. Some of the reasons you&rsquo;re likely to consider subagents:</p> <ol> <li>Provide an effective strategy for context window management. You could provide them access to uploaded files, and then ask them to extract specific data from those files, without polluting your primary agent&rsquo;s context window with the files&rsquo; content</li> <li>You could use subagents to support concurrent work. For example, you could allow invocation of multiple subagents at once, and then join on the completion of all subagents. If your agent workflows are predominantly constrained by network IO (to e.g. model evaluation APIs), then this could support significant reduction in clock-time to complete your workflows</li> <li>I think you could convince yourself that there are some security benefits to performing certain operations in subagents with less access. I don&rsquo;t actually believe that&rsquo;s meaningfully better, but you could at least introduce friction by ensuring that retrieving external resources and accessing internal resources can only occur in mutually isolated subagents</li> </ol> <p>Of all these reasons, I think that either the first or the second will be most relevant to the majority of internal workflow developers.</p> <h2 id="how-we-implemented-subagents">How we implemented subagents</h2> <p>Our implementation for subagents is quite straightforward:</p> <ol> <li>We define subagents in <code>subagents/*.yaml</code>, where each subagent has a prompt, allowed tools (or option to inherit all tools from parent agent), and a subset of the configurable fields from our agent configuration</li> <li>Each agent is configured to allow specific subagents, e.g. the <code>planning</code> subagent</li> <li>Agents invoke subagents via the <code>subagent(agent_name, prompt, files)</code> tool, which allows them to decide which virtual files are accessible within the subagent, and also the user prompt passed to the subagent (the subagent already has a default system prompt within its configuration)</li> </ol> <p>This has worked fairly well. For example, supporting the quick addition of <code>planning</code> and <code>think</code> subagents which the parent agent can use to refine its work. We further refactored the implementation of the harness running agents to be equivalent to subagents, where effectively every agent is a subagent, and so forth.</p> <h2 id="how-this-has-worked--what-next">How this has worked / what next</h2> <p>To be totally honest, I just haven&rsquo;t found subagents to be particularly important to our current workflows. However, user-facing latency is a bit of an invisible feature, with it not mattering at all until at some point it starts subtly creating undesirable user workflows (e.g. starting a different task before checking the response), so I believe long-term this will be the biggest advantage for us.</p> <p>Addendum: as alluded to in the introduction, this subagents functionality ended up being extremely useful when we introduced <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a>, as it allows handing off control to the LLM for a very specific determination, before returning control to the code.</p>Building an internal agent: Code-driven vs LLM-driven workflowshttps://lethain.com/agents-coordinators/Wed, 31 Dec 2025 09:30:00 -0800https://lethain.com/agents-coordinators/<p>When I started this project, I knew deep in my heart that we could get an LLM plus tool-usage to solve arbitrarily complex workflows. I still believe this is possible, but I&rsquo;m no longer convinced this is actually a good solution. Some problems are just vastly simpler, cheaper, and faster to solve with software. This post talks about our approach to supporting both code and LLM-driven workflows, and why we decided it was necessary.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-determinism-matters">Why determinism matters</h2> <p>When I joined Imprint, we already had a channel where folks would share pull requests for review. It wasn&rsquo;t <em>required</em> to add pull requests to that channel, but it was often the fastest way to get someone to review it, particularly for cross-team pull requests.</p> <p>I often start my day by skimming for pull requests that need a review in that channel, and quickly realized that often a pull request would get reviewed and merged without someone adding the <code>:merged:</code> reacji onto the chat. This felt inefficient, but also extraordinarily minor, and not the kind of thing I want to complain about. Instead, I pondered how I could solve it without requiring additional human labor.</p> <p>So, I added an LLM-powered workflow to solve this. The prompt was straightforward:</p> <ol> <li>Get the last 10 messages in the Slack channel</li> <li>For each one, if there was exactly one Github pull request URL, extract that URL</li> <li>Use the Github MCP to check the status of each of those URLs</li> <li>Add the <code>:merged:</code> reacji to messages where the associated pull request was merged or closed</li> </ol> <p>This worked so well! So, so well. Except, ahh, except that it sometimes decided to add <code>:merged:</code> to pull requests that weren&rsquo;t merged. Then no one would look at those pull requests. So, it worked in concept&ndash;so much smart tool usage!&ndash;but in practice it actually didn&rsquo;t solve the problem I was trying to solve: erroneous additions of the reacji meant folks couldn&rsquo;t evaluate whether to look at a given pull request in the channel based on the reacji&rsquo;s presence.</p> <p>(As an aside, some people really don&rsquo;t like the term <code>reacji</code>. Don&rsquo;t complain to me about it, this is <a href="https://docs.slack.dev/reference/methods/reactions.add/">what Slack calls them</a>.)</p> <h2 id="how-we-implemented-support-for-code-driven-workflows">How we implemented support for code-driven workflows</h2> <p>Our LLM-driven workflows are orchestrated by a software handler. That handler works something like:</p> <ol> <li>Trigger comes in, and the handler selects which configuration corresponds with the trigger</li> <li>Handler uses that configuration and trigger to pull the associated prompt, load the approved tools, and generate the available list of virtual files (e.g. files attached to a Jira issue or Slack message)</li> <li>Handler sends the prompt and available tools to an LLM, then coordinates tool calls based on the LLM&rsquo;s response, including e.g. making virtual files available to tools. The handler also has termination conditions where it prevents excessive tool usage, and so on</li> <li>Eventually the LLM will stop recommending tools, and the final response from the LLM will be used or discarded depending on the configuration (e.g. configuration can determine whether the final response is sent to Slack)</li> </ol> <p>We updated our configuration to allow running in one of two configurations:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># this is default behavior if omitted</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator</span>: <span style="color:#ae81ff">llm</span> </span></span><span style="display:flex;"><span> </span></span><span style="display:flex;"><span><span style="color:#75715e"># this is code-driven workflow</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator</span>: <span style="color:#ae81ff">script</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator_script</span>: <span style="color:#ae81ff">scripts/pr_merged.py</span> </span></span></code></pre></div><p>When the <code>coordinator</code> is set to <code>script</code>, then instead of using the handler to determine which tools are called, custom Python is used. That Python code has access to the same tools, trigger data, and virtual files as the LLM-handling code. It can use the <a href="https://lethain.com/agents-subagents/">subagent</a> tool to invoke an LLM where useful (and that subagent can have full access to tools as well), but LLM control only occurs when explicitly desired.</p> <p>This means that these scripts&ndash;which are being written and checked in by our software engineers, going through code review and so on&ndash;have the same permission and capabilities as the LLM, although given it&rsquo;s just code, any given commit could also introduce a new dependency, etc.</p> <h2 id="hows-it-working--next-steps">How&rsquo;s it working? / Next steps?</h2> <p>Altogether, this has worked very well for complex workflows. I would describe it as a &ldquo;solution of frequent resort&rdquo;, where we use code-driven workflows as a progressive enhancement for workflows where LLM prompts and tools aren&rsquo;t reliable or quick enough. We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.</p> <p>Even as models get more powerful, relying on them narrowly in cases where we truly need intelligence, rather than for iterative workflows, seems like a long-term addition to our toolkit.</p>Building an internal agent: Logging and debugabilityhttps://lethain.com/agents-logging/Wed, 31 Dec 2025 09:15:00 -0800https://lethain.com/agents-logging/<p>Agents are extremely impressive, but they also introduce a lot of non-determinism, and non-determinism means sometimes weird things happen. To combat that, we&rsquo;ve needed to instrument our workflows to make it possible to debug why things are going wrong.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-logging-matters">Why logging matters</h2> <p>Whenever an agent does something sub-optimal, folks flag it as a bug. Often, the &ldquo;bug&rdquo; is ambiguity in the prompt that led to sub-optimal tool usage. That makes <em>me</em> feel better, but it doesn&rsquo;t make the folks relying on these tools feel any better: they just expect the tools to work.</p> <p>This means that debugging unexpected behavior is a significant part of rolling out agents internally, and it&rsquo;s important to make it easy enough to do it frequently. If it takes too much time, effort or too many permissions, then your agents simply won&rsquo;t get used.</p> <h2 id="how-we-implemented-logging">How we implemented logging</h2> <p>Our agents run in an AWS Lambda, so the very first pass at logging was simply printing to standard out to be captured in the Lambda&rsquo;s logs. This worked OK for the very first steps, but also meant that I had to log into AWS every time something went wrong, and even many engineers didn&rsquo;t know where to find logs.</p> <p>The second pass was creating the <code>#ai-logs</code> channel, where every workflow run shared its configuration, tools used, and a link to the AWS URL where logs could be found. This was a step up, but still required a bunch of log spelunking to answer basic questions.</p> <p>The third pass, which is our current implementation, was integrating <a href="https://docs.datadoghq.com/llm_observability/">Datadog&rsquo;s LLM Observability</a> which provides an easy to use mechanism to view each span within the LLM workflow, making it straightforward to debug nuanced issues without digging through a bunch of logs. This is a massive improvement.</p> <p>It&rsquo;s also worth noting that the Datadog integration also made it easy to introduce dashboarding for our internal efforts, which has been a very helpful, missing ingredient to our work.</p> <h2 id="how-is-it-working--whats-next">How is it working? / What&rsquo;s next?</h2> <p>I&rsquo;ll be honest: the Datadog LLM observability toolkit is just great. The only problem I have at this point is that we mostly constrain Datadog accounts to folks within the technology organization, so workflow debugging isn&rsquo;t very accessible to folks outside that team. However, in practice there are very few folks who would be actively debugging these workflows who don&rsquo;t already have access, so it&rsquo;s more of a philosophical issue than a practical one.</p>