Irrational Exuberancehttps://lethain.com/Recent content on Irrational ExuberanceHugo -- gohugo.ioen-usWill LarsonSun, 04 Jan 2026 08:00:00 -0800Building internal agentshttps://lethain.com/agents-series/Thu, 01 Jan 2026 09:00:00 -0800https://lethain.com/agents-series/<p>A few weeks ago in <a href="https://lethain.com/company-ai-adoption/">Facilitating AI adoption at Imprint</a>, I mentioned our internal agent workflows that we are developing. This is not the core of Imprint&ndash;our core is powering co-branded credit card programs&ndash;and I wanted to document how a company like ours is developing these internal capabilities.</p> <p>Building on that post&rsquo;s ideas like a company-public prompt library for the prompts powering internal workflows, I wanted to write up some of the interesting problems and approaches we&rsquo;ve taken as we&rsquo;ve evolved our workflows, split into a series of shorter posts:</p> <ol> <li><a href="https://lethain.com/agents-skills/">Skill support</a></li> <li><a href="https://lethain.com/agents-large-files/">Progressive disclosure and large files</a></li> <li><a href="https://lethain.com/agents-context-compaction/">Context window compaction</a></li> <li><a href="https://lethain.com/agents-evals/">Evals to validate workflows</a></li> <li><a href="https://lethain.com/agents-logging/">Logging and debugability</a></li> <li><a href="https://lethain.com/agents-subagents/">Subagents</a></li> <li><a href="https://lethain.com/agents-coordinators/">Code-driven vs LLM-driven workflows</a></li> <li><a href="https://lethain.com/agents-triggers/">Triggers</a></li> <li><a href="https://lethain.com/agents-iterative-refinement/">Iterative prompt and skill refinement</a></li> </ol> <p>In the same spirit as the original post, I&rsquo;m not writing these as an industry expert unveiling best practice, rather these are just the things that we&rsquo;ve specifically learned along the way. If you&rsquo;re developing internal frameworks as well, then hopefully you&rsquo;ll find something interesting in these posts.</p> <h2 id="building-your-intuition-for-agents">Building your intuition for agents</h2> <p>As more folks have read these notes, a recurring response has been, &ldquo;How do I learn this stuff?&rdquo; Although I haven&rsquo;t spent time evaluating if this is the <em>best</em> way to learn, I can share what I have found effective:</p> <ol> <li>Reading a general primer on how Large Language Models work, such as <em><a href="https://www.amazon.com/AI-Engineering-Building-Applications-Foundation/dp/1098166302">AI Engineering</a></em> by Chip Huyen. You could also do a brief tutorial too, you don&rsquo;t need the ability to create an LLM yourself, just a mental model of what they&rsquo;re capable of</li> <li>Build a script that uses a basic LLM API to respond to a prompt</li> <li>Extend that script to support tool calling for some basic tools like searching files in a local git repository (or whatever)</li> <li>Implement a <code>tool_search</code> tool along the lines of <a href="https://www.anthropic.com/engineering/advanced-tool-use">Anthropic Claude&rsquo;s tool_search</a>, which uses a separate context window to evaluate your current context window against available skills and return only the relevant skills to be used within your primary context window</li> <li>Implement a virtual file system, such that tools can operate on references to files that are not within the context window. Also add a series of tools to operate on that virtual file system like <code>load_file</code>, <code>grep_file</code>, or whatnot</li> <li>Support Agent Skills, particularly <code>load_skills</code> tool and enhancing the prompt with available skills</li> <li>Write post-workflow eval that runs automatically after each workflow and evaluates the quality of the workflow run</li> <li>Add context-window compaction support to keep context windows below a defined size Make sure that some of your tool responses are large enough to threaten your context-window&rsquo;s limit, such that you&rsquo;re forced to solve that problem</li> </ol> <p>After working through the implementation of each of these features, I think you will have a strong foundation into how to build and extend these kinds of systems. The only missing piece is supporting <a href="https://lethain.com/agents-coordinators/">code-driven agents</a>, but unfortunately I think it&rsquo;s hard to demonstrate the need of code-driven agents in simple examples, because LLM-driven agents are sufficiently capable to solve most contrived examples.</p> <h2 id="why-didnt-you-just-use-x">Why didn&rsquo;t you just use X?</h2> <p>There are many existing agent frameworks, including <a href="https://platform.openai.com/docs/guides/agents-sdk">OpenAI Agents SDK</a> and <a href="https://platform.claude.com/docs/en/agent-sdk/overview">Claude&rsquo;s Agents SDK</a>. Ultimately, I think these are fairly thin wrappers, and that you&rsquo;ll learn <em>a lot more</em> by implementing these yourself, but I&rsquo;m less confident that you&rsquo;re better off long-term building your own framework.</p> <p>My general recommendation would be to build your own to throw away, and then try to build on top of one of the existing frameworks if you find any meaningful limitations. That said, I really don&rsquo;t regret the decision to build our own, because it&rsquo;s just so simple from a code perspective.</p> <h2 id="final-thoughts">Final thoughts</h2> <p>I think every company should be doing this work internally, very much including companies that aren&rsquo;t doing any sort of direct AI work in their product. It&rsquo;s very fun work to do, there&rsquo;s a lot of room for improvement, and having an engineer or two working on this is a relatively cheap option to derisk things if AI-enhanced techniques continue to improve as rapidly in 2026 as they did in 2025.</p>Building an internal agent: Iterative prompt and skill refinementhttps://lethain.com/agents-iterative-refinement/Thu, 01 Jan 2026 08:30:00 -0800https://lethain.com/agents-iterative-refinement/<p>Some of our internal workflows are being used quite frequently, and usage reveals gaps in the current prompts, skills, and tools. Here is how we&rsquo;re working to iterate on these internal workflows.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-does-iterative-refinement-matter">Why does iterative refinement matter?</h2> <p>When companies push on AI-led automation, specifically meaning LLM agent-driven automation, there are two major goals. First is the short-term goal of increasing productivity. That&rsquo;s a good goal. Second, and I think even more importantly, is the long-term goal of helping their employees build a healthy intuition for how to use various kinds of agents to accomplish complex tasks.</p> <p>If we see truly remarkable automation benefits from the LLM wave of technology, it&rsquo;s not going to come from the first-wave of specific tools we build, but the output of a new class of LLM-informed users and developers. There is nowhere that you can simply acquire that talent, instead it&rsquo;s talent that you have to develop inhouse, and involving more folks in iterative refinement of LLM-driven systems is the most effective approach that I&rsquo;ve encountered.</p> <h2 id="how-are-we-enabling-iterative-refinement">How are we enabling iterative refinement?</h2> <p>We&rsquo;ve taken a handful of different approaches here, all of which are currently in use. From earliest to latest, our approaches have been:</p> <ol> <li> <p><strong>Being responsive to feedback</strong> is our primary mechanism for solving issues. This is both responding quickly in an internal <code>#ai</code> channel, but also skimming through workflows each day to see humans interacting, for better and for worse, with the agents. This is the most valuable ongoing source of improvement.</p> </li> <li> <p><strong>Owner-led refinement</strong> has been our intended primary mechanism, although in practice it&rsquo;s more of the secondary mechanism. We store our prompts in Notion documents, where they can be edited by their owners in real-time. Permissions vary on a per-document basis, but most prompts are editable by anyone at the company, as we try to facilitate rapid learning.</p> <p>Editable prompts alone aren&rsquo;t enough, these prompts also need to be discoverable. To address that, whenever an action is driven by a workflow, we include a link to the prompt. For example, a Slack message sent by a chat bot will include a link to the prompt, as well a comment in Jira.</p> </li> <li> <p><strong>Claude-enhanced, owner-led refinement</strong> via the Datadog MCP to pull logs into the repository where the skills live has been fairly effective, although mostly as a technique used by the AI Engineering team rather than directly by owners. Skills are a bit of a platform, as they are used by many different workflows, so it may be inevitable that they are maintained by a central team rather than by workflow owners.</p> </li> <li> <p><strong>Dashboard tracking</strong> shows how often each workflow runs and errors associated with those runs. We also track how often each tool is used, including how frequently each skill is loaded.</p> </li> </ol> <p>My guess is that we will continue to add more refinement techniques as we go, without being able to get rid of any of the existing ones. This is sort of disappointing&ndash;I&rsquo;d love to have the same result with fewer&ndash;but I think we&rsquo;d be worse off if we cut any of them.</p> <h2 id="next-steps">Next steps</h2> <p>What we don&rsquo;t do yet, but is the necessary next step to making this truly useful, is to include a subjective post-workflow eval that determines whether the workflow was effective. While we have <a href="https://lethain.com/agents-evals/">evals to evaluate workflows</a>, this would be using evals to evaluate individual workflow runs, which would provide a level of very useful detail to understand.</p> <h2 id="how-its-going">How it&rsquo;s going</h2> <p>In our experience thus far, there are roughly three workflow archetypes: chatbots, very well understood iterative workflows (e.g. applying <code>:merge:</code> reacji to merged PRs as discussed in <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a>), and not-yet-well-understood workflows.</p> <p>Once we build a code-driven workflow, they have always worked well for us, because we have built a very focused, well-understood solution at that point. Conversely, chatbots are an extremely broad, amorphous problem space, and I think post-run evals will provide a high quality dataset to improve them iteratively with a small amount of human-in-the-loop to nudge the evolution of their prompts and skills.</p> <p>The open question, for us anyway, is how we do a better job of identifying and iterating on the not-yet-well-understood workflows. Ideally without requiring a product engineer to understand and implement each of them individually. We&rsquo;ve not <em>scalably</em> cracked this one yet, and I do think scalably cracking it is the key to whether these internal agents are <em>somewhat useful</em> (frequently performed tasks performed by many people eventually get automated) and are truly transformative (a significant percentage of tasks, even infrequent ones performed by a small number of people get automated).</p>Building an internal agent: Subagent supporthttps://lethain.com/agents-subagents/Wed, 31 Dec 2025 09:45:00 -0800https://lethain.com/agents-subagents/<p>Most of the extensions to our internal agent have been the direct result of running into a problem that I couldn&rsquo;t elegantly solve within our current framework. Evals, compaction, large-file handling all fit into that category. Subagents, allowing an agent to initiate other agents, are in a different category: I&rsquo;ve frequently thought that we needed subagents, and then always found an alternative that felt more natural.</p> <p>Eventually, I decided to implement them anyway, because it seemed like an interesting problem to reason through. Eventually I would need them&hellip; right? (Aside: I did, indeed, eventually use subagents to support <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a> invoking LLMs.)</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-subagents-matter">Why subagents matter</h2> <p>&ldquo;Subagents&rdquo; is the name for allowing your agents to invoke other agents, which have their own system prompt, available tools, and context windows. Some of the reasons you&rsquo;re likely to consider subagents:</p> <ol> <li>Provide an effective strategy for context window management. You could provide them access to uploaded files, and then ask them to extract specific data from those files, without polluting your primary agent&rsquo;s context window with the files&rsquo; content</li> <li>You could use subagents to support concurrent work. For example, you could allow invocation of multiple subagents at once, and then join on the completion of all subagents. If your agent workflows are predominantly constrained by network IO (to e.g. model evaluation APIs), then this could support significant reduction in clock-time to complete your workflows</li> <li>I think you could convince yourself that there are some security benefits to performing certain operations in subagents with less access. I don&rsquo;t actually believe that&rsquo;s meaningfully better, but you could at least introduce friction by ensuring that retrieving external resources and accessing internal resources can only occur in mutually isolated subagents</li> </ol> <p>Of all these reasons, I think that either the first or the second will be most relevant to the majority of internal workflow developers.</p> <h2 id="how-we-implemented-subagents">How we implemented subagents</h2> <p>Our implementation for subagents is quite straightforward:</p> <ol> <li>We define subagents in <code>subagents/*.yaml</code>, where each subagent has a prompt, allowed tools (or option to inherit all tools from parent agent), and a subset of the configurable fields from our agent configuration</li> <li>Each agent is configured to allow specific subagents, e.g. the <code>planning</code> subagent</li> <li>Agents invoke subagents via the <code>subagent(agent_name, prompt, files)</code> tool, which allows them to decide which virtual files are accessible within the subagent, and also the user prompt passed to the subagent (the subagent already has a default system prompt within its configuration)</li> </ol> <p>This has worked fairly well. For example, supporting the quick addition of <code>planning</code> and <code>think</code> subagents which the parent agent can use to refine its work. We further refactored the implementation of the harness running agents to be equivalent to subagents, where effectively every agent is a subagent, and so forth.</p> <h2 id="how-this-has-worked--what-next">How this has worked / what next</h2> <p>To be totally honest, I just haven&rsquo;t found subagents to be particularly important to our current workflows. However, user-facing latency is a bit of an invisible feature, with it not mattering at all until at some point it starts subtly creating undesirable user workflows (e.g. starting a different task before checking the response), so I believe long-term this will be the biggest advantage for us.</p> <p>Addendum: as alluded to in the introduction, this subagents functionality ended up being extremely useful when we introduced <a href="https://lethain.com/agents-coordinators/">code-driven workflows</a>, as it allows handing off control to the LLM for a very specific determination, before returning control to the code.</p>Building an internal agent: Code-driven vs LLM-driven workflowshttps://lethain.com/agents-coordinators/Wed, 31 Dec 2025 09:30:00 -0800https://lethain.com/agents-coordinators/<p>When I started this project, I knew deep in my heart that we could get an LLM plus tool-usage to solve arbitrarily complex workflows. I still believe this is possible, but I&rsquo;m no longer convinced this is actually a good solution. Some problems are just vastly simpler, cheaper, and faster to solve with software. This post talks about our approach to supporting both code and LLM-driven workflows, and why we decided it was necessary.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-determinism-matters">Why determinism matters</h2> <p>When I joined Imprint, we already had a channel where folks would share pull requests for review. It wasn&rsquo;t <em>required</em> to add pull requests to that channel, but it was often the fastest way to get someone to review it, particularly for cross-team pull requests.</p> <p>I often start my day by skimming for pull requests that need a review in that channel, and quickly realized that often a pull request would get reviewed and merged without someone adding the <code>:merged:</code> reacji onto the chat. This felt inefficient, but also extraordinarily minor, and not the kind of thing I want to complain about. Instead, I pondered how I could solve it without requiring additional human labor.</p> <p>So, I added an LLM-powered workflow to solve this. The prompt was straightforward:</p> <ol> <li>Get the last 10 messages in the Slack channel</li> <li>For each one, if there was exactly one Github pull request URL, extract that URL</li> <li>Use the Github MCP to check the status of each of those URLs</li> <li>Add the <code>:merged:</code> reacji to messages where the associated pull request was merged or closed</li> </ol> <p>This worked so well! So, so well. Except, ahh, except that it sometimes decided to add <code>:merged:</code> to pull requests that weren&rsquo;t merged. Then no one would look at those pull requests. So, it worked in concept&ndash;so much smart tool usage!&ndash;but in practice it actually didn&rsquo;t solve the problem I was trying to solve: erroneous additions of the reacji meant folks couldn&rsquo;t evaluate whether to look at a given pull request in the channel based on the reacji&rsquo;s presence.</p> <p>(As an aside, some people really don&rsquo;t like the term <code>reacji</code>. Don&rsquo;t complain to me about it, this is <a href="https://docs.slack.dev/reference/methods/reactions.add/">what Slack calls them</a>.)</p> <h2 id="how-we-implemented-support-for-code-driven-workflows">How we implemented support for code-driven workflows</h2> <p>Our LLM-driven workflows are orchestrated by a software handler. That handler works something like:</p> <ol> <li>Trigger comes in, and the handler selects which configuration corresponds with the trigger</li> <li>Handler uses that configuration and trigger to pull the associated prompt, load the approved tools, and generate the available list of virtual files (e.g. files attached to a Jira issue or Slack message)</li> <li>Handler sends the prompt and available tools to an LLM, then coordinates tool calls based on the LLM&rsquo;s response, including e.g. making virtual files available to tools. The handler also has termination conditions where it prevents excessive tool usage, and so on</li> <li>Eventually the LLM will stop recommending tools, and the final response from the LLM will be used or discarded depending on the configuration (e.g. configuration can determine whether the final response is sent to Slack)</li> </ol> <p>We updated our configuration to allow running in one of two configurations:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># this is default behavior if omitted</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator</span>: <span style="color:#ae81ff">llm</span> </span></span><span style="display:flex;"><span> </span></span><span style="display:flex;"><span><span style="color:#75715e"># this is code-driven workflow</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator</span>: <span style="color:#ae81ff">script</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">coordinator_script</span>: <span style="color:#ae81ff">scripts/pr_merged.py</span> </span></span></code></pre></div><p>When the <code>coordinator</code> is set to <code>script</code>, then instead of using the handler to determine which tools are called, custom Python is used. That Python code has access to the same tools, trigger data, and virtual files as the LLM-handling code. It can use the <a href="https://lethain.com/agents-subagents/">subagent</a> tool to invoke an LLM where useful (and that subagent can have full access to tools as well), but LLM control only occurs when explicitly desired.</p> <p>This means that these scripts&ndash;which are being written and checked in by our software engineers, going through code review and so on&ndash;have the same permission and capabilities as the LLM, although given it&rsquo;s just code, any given commit could also introduce a new dependency, etc.</p> <h2 id="hows-it-working--next-steps">How&rsquo;s it working? / Next steps?</h2> <p>Altogether, this has worked very well for complex workflows. I would describe it as a &ldquo;solution of frequent resort&rdquo;, where we use code-driven workflows as a progressive enhancement for workflows where LLM prompts and tools aren&rsquo;t reliable or quick enough. We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.</p> <p>Even as models get more powerful, relying on them narrowly in cases where we truly need intelligence, rather than for iterative workflows, seems like a long-term addition to our toolkit.</p>Building an internal agent: Logging and debugabilityhttps://lethain.com/agents-logging/Wed, 31 Dec 2025 09:15:00 -0800https://lethain.com/agents-logging/<p>Agents are extremely impressive, but they also introduce a lot of non-determinism, and non-determinism means sometimes weird things happen. To combat that, we&rsquo;ve needed to instrument our workflows to make it possible to debug why things are going wrong.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-logging-matters">Why logging matters</h2> <p>Whenever an agent does something sub-optimal, folks flag it as a bug. Often, the &ldquo;bug&rdquo; is ambiguity in the prompt that led to sub-optimal tool usage. That makes <em>me</em> feel better, but it doesn&rsquo;t make the folks relying on these tools feel any better: they just expect the tools to work.</p> <p>This means that debugging unexpected behavior is a significant part of rolling out agents internally, and it&rsquo;s important to make it easy enough to do it frequently. If it takes too much time, effort or too many permissions, then your agents simply won&rsquo;t get used.</p> <h2 id="how-we-implemented-logging">How we implemented logging</h2> <p>Our agents run in an AWS Lambda, so the very first pass at logging was simply printing to standard out to be captured in the Lambda&rsquo;s logs. This worked OK for the very first steps, but also meant that I had to log into AWS every time something went wrong, and even many engineers didn&rsquo;t know where to find logs.</p> <p>The second pass was creating the <code>#ai-logs</code> channel, where every workflow run shared its configuration, tools used, and a link to the AWS URL where logs could be found. This was a step up, but still required a bunch of log spelunking to answer basic questions.</p> <p>The third pass, which is our current implementation, was integrating <a href="https://docs.datadoghq.com/llm_observability/">Datadog&rsquo;s LLM Observability</a> which provides an easy to use mechanism to view each span within the LLM workflow, making it straightforward to debug nuanced issues without digging through a bunch of logs. This is a massive improvement.</p> <p>It&rsquo;s also worth noting that the Datadog integration also made it easy to introduce dashboarding for our internal efforts, which has been a very helpful, missing ingredient to our work.</p> <h2 id="how-is-it-working--whats-next">How is it working? / What&rsquo;s next?</h2> <p>I&rsquo;ll be honest: the Datadog LLM observability toolkit is just great. The only problem I have at this point is that we mostly constrain Datadog accounts to folks within the technology organization, so workflow debugging isn&rsquo;t very accessible to folks outside that team. However, in practice there are very few folks who would be actively debugging these workflows who don&rsquo;t already have access, so it&rsquo;s more of a philosophical issue than a practical one.</p>Building an internal agent: Evals to validate workflowshttps://lethain.com/agents-evals/Wed, 31 Dec 2025 09:00:00 -0800https://lethain.com/agents-evals/<p>Whenever a new pull request is submitted to our agent&rsquo;s GitHub repository, we run a bunch of CI/CD operations on it. We run an opinionated linter, we run typechecking, and we run a bunch of unittests. All of these work well, but none of them test entire workflows end-to-end. For that end-to-end testing, we introduced an <a href="https://platform.openai.com/docs/guides/evals">eval</a> pipeline.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-evals-matter">Why evals matter</h2> <p>The harnesses that run agents have a lot of interesting nuance, but they&rsquo;re generally pretty simple: some virtual file management, some tool invocation, and some context window management. However, it&rsquo;s very easy to create prompts that don&rsquo;t work well, despite the correctness of all the underlying pieces. Evals are one tool to solve that, exercising your prompts and tools together and grading the results.</p> <h2 id="how-we-implemented-it">How we implemented it</h2> <p>I had the good fortune to lead Imprint&rsquo;s implementation of <a href="https://sierra.ai/">Sierra</a> for chat and voice support, and I want to acknowledge that their approach has deeply informed my view of what does and doesn&rsquo;t work well here.</p> <p>The key components of Sierra&rsquo;s approach are:</p> <ol> <li>Sierra implements agents as a mix of React-inspired code that provide tools and progressively-disclosed context, and a harness runner that runs that software.</li> <li>Sierra allows your code to assign tags to conversations such as &ldquo;otp-code-sent&rdquo; or &ldquo;lang-spanish&rdquo; which can be used for filtering conversations, as well as other usecases discussed shortly.</li> <li>Every tool implemented for a Sierra agent has both a true and a mock implementation. For example, for a tool that searches a knowledge base, the true version would call its API directly, and the mock version would return a static (or locally generated) version for use in testing.</li> <li>Sierra names their eval implementation as &ldquo;simulations.&rdquo; You can create any number of simulations either in code or via the UI-driven functionality.</li> <li>Every evaluation has an initial prompt, metadata about the situation that&rsquo;s available to the software harness running the agent, and criteria to evaluate whether an evaluation succeeds.</li> <li>These evaluation criteria are both subjective and objective. The subjective criteria are &ldquo;agent as judge&rdquo; to assess whether certain conditions were met (e.g. was the response friendly?). The objective criteria are whether specific tags (&ldquo;login-successful&rdquo;) were, or were not (&ldquo;login-failed&rdquo;) added to a conversation.</li> </ol> <p>Then when it comes to our approach, we basically just reimplemented that approach as it&rsquo;s worked well for us. For example, the following image is the configuration for an eval we run.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2025/evals-config.png" alt="YAML configuration for an eval showing a Slack reaction JIRA workflow test with expected tools and evaluation criteria"></p> </div> <p>Then whenever a new PR is opened, these run automatically along with our other automation.</p> <div class="ba b--light-gray"> <p><img src="https://lethain.com/static/blog/2025/evals-gha.png" alt="GitHub Actions bot comment showing eval results with 6 failed tests and tool mismatch details"></p> </div> <p>While we largely followed the map laid out by Sierra&rsquo;s implementation, we did diverge on the tags concept. For objective evaluation, we rely exclusively on tools that are, or are not, called. Sierra&rsquo;s tag implementation is more flexible, but since our workflows are predominantly prompt-driven rather than code-driven, it&rsquo;s not an easy one for us to adopt</p> <p>Altogether, following this standard implementation worked well for us.</p> <h2 id="how-is-it-working">How is it working?</h2> <p>Ok, this is working well, but not nearly as well as I hoped it would. The core challenge is the non-determinism introduced by these eval tests, where in practice there&rsquo;s very strong signal when they all fail, and strong signal when they all pass, but most runs are in between those two. A big part of that is sloppy eval prompts and sloppy mock tool results, and I&rsquo;m pretty confident I could get them passing more reliably with some effort (e.g. I did get our Sierra tests <em>almost</em> always passing by tuning them closely, although even they aren&rsquo;t perfectly reliable).</p> <p>The biggest issue is that our reliance on prompt-driven workflows rather than code-driven workflows introduces a lot of non-determinism, which I don&rsquo;t have a way to solve without the aforementioned prompt and mock tuning.</p> <h2 id="whats-next">What&rsquo;s next?</h2> <p>There are three obvious follow ups:</p> <ol> <li>More tuning on prompts and mocked tool calls to make the evals more probabilistically reliable</li> <li>I&rsquo;m embarrassed to say it out loud, but I suspect we need to retry failed evals to see if they pass e.g. &ldquo;at least once in three tries&rdquo; to make this something we can introduce as a blocking mechanism in our CI/CD</li> <li>This highlights the general limitation of LLM-driven workflows, and I suspect that I&rsquo;ll have to move more complex workflows away from LLM-driven workflows to get them to work more consistently</li> </ol> <p>Altogether, I&rsquo;m very glad that we introduced evals, they are an essential mechanism for us to evaluate our workflows, but we&rsquo;ve found them difficult to consistently operationalize as something we can rely on as a blocking tool rather than directionally relevant context.</p>Building an internal agent: Triggershttps://lethain.com/agents-triggers/Wed, 31 Dec 2025 08:00:00 -0800https://lethain.com/agents-triggers/<p>An internal agent only provides value when its workflows are initiated. Building out a library of workflow initializations, which we call triggers, is a core part of building an internal agent.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-triggers-matter">Why triggers matter</h2> <p>While there&rsquo;s a lot of chatter about AI empowering employees to trivially automate their day-to-day workflows, building these workflows requires an intuition about how agents work that requires iterative learning to develop, and few workplaces are providing the tools to facilitate that iterative learning. Easy triggers, combined with easy prompt iteration, are a foundational part of iterative learning.</p> <p>More practically, triggers are also the mechanisms that initiate workflows, so nothing else matters if they aren&rsquo;t effective and usable.</p> <h2 id="why-not-zapier-or-n8n">Why not Zapier or n8n?</h2> <p>The obvious question here is &ldquo;why not Zapier?&rdquo; and &ldquo;why not n8n?&rdquo;, both of which would have solved this triggering problem in its entirety. For what it&rsquo;s worth, you still could use Zapier or n8n to trigger agent workflows using a custom webhook trigger, so these approaches aren&rsquo;t incompatible. That said, for me there are only a small number of workflows I thought would matter, and I wanted to have full control of the nuances as we worked on the problem.</p> <p>The &ldquo;full control&rdquo; piece ties back to one of my underlying theses of this work: the quality details facilitate adoption in a way that Zapier integration&rsquo;s constraints simply do not. This is why I think internal agents need to spend so much time managing things like Slack-entity resolution to become a seamless experience.</p> <h2 id="how-we-implemented-triggers">How we implemented triggers</h2> <p>We&rsquo;ve implemented these triggers, and in this order:</p> <ol> <li> <p><strong>Notion webhooks</strong> can be configured to fire on any page or database modifications, including things going from &ldquo;draft&rdquo; status to &ldquo;ready for review&rdquo; status or other more nuanced changes. We&rsquo;ve updated all our Request for Comment and other structured document databases to hook into these.</p> <p>The triggers drive a variety of workflows. On the simpler end, they can be used to assign additional reviewers to a document based on topic, and on the more complex end they can use a prompt to provide in-line comments with suggestions.</p> </li> <li> <p><strong>Slack messages</strong> can trigger responses or other workflows. This is dependent on which channels the bot has been invited into, and I subsequently made a trigger to capture channel creation to support auto-joining new channels to increase availability. (Channels that are auto-joined have a simple, default prompt that only responds when the bot is mentioned directly, to avoid wearing out its welcome.)</p> <p>We also got this working in private channels (not the auto-join component, just responding to messages). The hard part was ensuring that logging didn&rsquo;t capture any evidence of joining or responding in those private channels. Exfiltrating private messages would have been a very easy way to quickly lose trust.</p> </li> <li> <p><strong>Jira webhooks</strong> can be configured to trigger notifications on issue creation, updates, comments and so on.</p> </li> <li> <p><strong>Slack reacji</strong> were a later addition, where we added support for listening to Slack reacji in either a given channel or in <em>any</em> channel where the bot is present. This has made it possible to quickly implement a trigger where the <code>:jira:</code> reacji in any channel turns a thread into a ticket, routing it using centralized routing instructions (which are imperfect, but much better than the average individual&rsquo;s ability to route across many different teams).</p> </li> <li> <p><strong>Scheduled events</strong> are our most recent addition, allowing periodic triggers. I&rsquo;ve done two distinct v0 implementations, with the first relying on Slack workflows publishing into new channels as triggers, and the other relying on Github actions. Both work well enough, despite being a bit messy.</p> </li> </ol> <p>Now that we have a fairly large category, and have updated our <code>AGENTS.md</code> to reflect our existing integrations, adding new sorts of triggers is generally pasting documentation into Claude and waiting a few minutes.</p> <h2 id="trigger-authn-and-authz">Trigger authn and authz</h2> <p>When it comes to authentication and authorization, where possible, we&rsquo;ve adopted the OAuth2 approach of authorization tokens and SSL. However, that isn&rsquo;t possible for scenarios like Slack where we can&rsquo;t inject an authorization token into the requests. In those situations we rely on the application-specific security mechanisms provided by the underlying platform, such as <a href="https://docs.slack.dev/authentication/verifying-requests-from-slack/">Slack&rsquo;s mechanisms for verifying requests</a>.</p> <h2 id="how-is-it-working-whats-next">How is it working? What&rsquo;s next?</h2> <p>Our current set of triggers are working well. The next task here&ndash;probably a two sentence prompt and ten minutes of Claude code away&ndash;is adding a trigger on more generic webhooks that includes the whole incoming message into the context window. This hasn&rsquo;t been necessary yet, but is a useful catch-all to support future workflows.</p> <p>In particular, it would be straightforward to pair with an existing Zapier subscription, which would offload supporting certain esoteric triggers to their large catalog of integrations, while still getting the control and nuance of the internal agents.</p>Building an internal agent: Context window compactionhttps://lethain.com/agents-context-compaction/Fri, 26 Dec 2025 09:00:00 -0700https://lethain.com/agents-context-compaction/<p>Although my model of choice for most internal workflows remains <a href="https://platform.openai.com/docs/models/gpt-4.1">ChatGPT 4.1</a> for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="why-compaction-matters">Why compaction matters</h2> <p>Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you&rsquo;ll find for internal agents, ultimately it&rsquo;s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.</p> <h2 id="how-we-implemented-it">How we implemented it</h2> <p>Initially, in the beautiful moment where we assumed compaction wouldn&rsquo;t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn&rsquo;t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more <a href="https://lethain.com/agents-large-files/">workflows with large files</a>.</p> <p>In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after <a href="https://lethain.com/agents-subagents/">implementing support for subagents</a>, but I was never able to ground that intuition in a concrete reason why it was necessary, and we implemented compaction without subagent support.</p> <p>The gist of our approach to compaction is:</p> <ol> <li> <p>After every user message (including tool responses), add a system message with the consumed and available tokens in the context window. In that system message, we also include the updated list of available <code>files</code> that can be read from</p> </li> <li> <p>User messages and tool responses greater than 10,000 tokens are exposed as a new &ldquo;virtual file&rdquo;, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens (both 1k and 10k are configurable values)</p> </li> <li> <p>Add a set of &ldquo;base tools&rdquo; that are always available to agents, specifically including the virtual file manipulation tools, as we&rsquo;d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal tools. These tools were <code>file_read</code> which can read entire files, lines ranges within a file, or byte ranges within a file, and <code>file_regex</code> which is similar but performs a regex scan against a file up to a certain number of matches.</p> <p>Every use of a file is recorded in the <code>files</code> data, so the agent knows what has and hasn&rsquo;t been read into the context window (particularly relevant for preloaded files), along the lines of:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;files&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;file</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#39;a&#39;</span> <span style="color:#a6e22e">name=</span><span style="color:#e6db74">&#39;image.png&#39;</span> <span style="color:#a6e22e">size=</span><span style="color:#e6db74">&#39;32kb&#39;</span><span style="color:#f92672">&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;file_read</span> <span style="color:#f92672">/&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;/file&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;file</span> <span style="color:#a6e22e">id=</span><span style="color:#e6db74">&#39;a&#39;</span> <span style="color:#a6e22e">name=</span><span style="color:#e6db74">&#39;image.png&#39;</span> <span style="color:#a6e22e">size=</span><span style="color:#e6db74">&#39;32kb&#39;</span><span style="color:#f92672">&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;file_read</span> <span style="color:#a6e22e">start_line=</span><span style="color:#e6db74">10</span> <span style="color:#a6e22e">end_line=</span><span style="color:#e6db74">20</span> <span style="color:#f92672">/&gt;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">&lt;/file&gt;</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/files&gt;</span> </span></span></code></pre></div><p>This was surprisingly annoying to implement cleanly, mostly because I came onto this idea after iteratively building the agent as a part-time project for several months. If I could start over, I would <em>start</em> with files as a core internal construct, rather than adding it on later.</p> </li> <li> <p>If a message pushed us over 80% (configurable value) of the model&rsquo;s available context window, use <a href="https://www.reddit.com/r/ClaudeAI/comments/1jr52qj/here_is_claude_codes_compact_prompt/">the compaction prompt that Reddit claims Claude Code uses</a>. The prompt isn&rsquo;t particularly special, it just already exists and seems pretty good</p> </li> <li> <p>After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost</p> </li> </ol> <p>Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.</p> <h2 id="how-is-it-working--whats-next">How is it working? / What&rsquo;s next?</h2> <p>Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it&rsquo;s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don&rsquo;t require large context windows, and, second, honestly this seems to work quite well.</p> <p>If context windows get significantly larger in the future, which I don&rsquo;t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.</p>Building an internal agent: Progressive disclosure and handling large fileshttps://lethain.com/agents-large-files/Fri, 26 Dec 2025 08:00:00 -0700https://lethain.com/agents-large-files/<p>One of the most useful initial extensions I made to our workflows was injecting associated images into the context window automatically, to improve the quality of responses to tickets and messages that relied heavily on screenshots. This was quick and made the workflows significantly more powerful.</p> <p>More recently, there are a number of workflows attempting to operate on large complex files like PDFs or DOCXs, and the naive approach of shoving them into the context window hasn&rsquo;t worked particularly well. This post explains how we&rsquo;ve adapted the principle of progressive disclosure to allow our internal agents to work with large files.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="large-files-and-progressive-disclosure">Large files and progressive disclosure</h2> <p><a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">Progressive disclosure</a> is the practice of limiting what is added to the context window to the minimum necessary amount, and adding more detail over time as necessary.</p> <p>A good example of progressive disclosure is <a href="https://lethain.com/agents-skills/">how agent skills are implemented</a>:</p> <ol> <li>Initially, you only add the description of each available skill into the context window</li> <li>You then load the <code>SKILL.md</code> on demand</li> <li>The <code>SKILL.md</code> can specify other files to be further loaded as helpful</li> </ol> <p>In our internal use-case, we have skills for JIRA formatting, Slack formatting, and Notion formatting. Some workflows require all three, but the vast majority of workflows require at most one of these skills, and it&rsquo;s straightforward for the agent to determine which are relevant to a given task.</p> <p>File management is a particularly interesting progressive disclosure problem, because files are so helpful in many scenarios, but are also so very large. For example, requests for help in Slack are often along the lines of &ldquo;I need help with this login issue <screenshot>&rdquo;, which is impossible to solve without including that image into the context window. In other workflows, you might want to analyze a daily data export in a very large PDF which is 5-10MB as a PDF, but only 10-20kb of tables and text when extracted from the PDF. This gets even messier when the goal is to compare across multiple PDFs, each of which is quite large.</p> <h2 id="our-approach">Our approach</h2> <p>Our high-level approach to the large-file problem is as follows:</p> <ol> <li> <p>Always include metadata about available files in the prompt, similar to the list of available skills. This will look something like:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">Files</span>: </span></span><span style="display:flex;"><span> - <span style="color:#f92672">id</span>: <span style="color:#ae81ff">f_a1</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">name</span>: <span style="color:#ae81ff">my_image.png</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">size</span>: <span style="color:#ae81ff">500</span>,<span style="color:#ae81ff">000</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">preloaded</span>: <span style="color:#66d9ef">false</span> </span></span><span style="display:flex;"><span> - <span style="color:#f92672">id</span>: <span style="color:#ae81ff">f_b3</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">name</span>: <span style="color:#ae81ff">...</span> </span></span></code></pre></div><p>The key thing is that each <code>id</code> is a reference that the agent is able to pass to tools. This allows it to operate on files without loading their context into the context window.</p> </li> <li> <p>Automatically preload the first N kb of files into the context window, as long as they are appropriate mimetypes for loading (png, pdf, etc). This is per-workflow configurable, and could be set as low as <code>0</code> if a given workflow didn&rsquo;t want to preload any files.</p> <p>I&rsquo;m still of mixed minds whether preloading is worth doing, as it takes some control away from the agent.</p> </li> <li> <p>Provide three tools for operating on files:</p> <ul> <li><code>load_file(id)</code> loads an entire file into the context window</li> <li><code>peek_file(id, start, stop)</code> loads a section of a file into the context window</li> <li><code>extract_file(id)</code> transforms PDFs, PPTs, DOCX and so on into simplified textual versions</li> </ul> </li> <li> <p>Provide a <code>large_files</code> skill which explains how and when to use the above tools to work with large files. Generally, it encourages using <code>extract_file</code> on any PDF, DOCX or PPT file that it wants to work with, and otherwise loading or peeking depending on the available space in the context window</p> </li> </ol> <p>This approach was quick to implement, and provides significantly more control to the agent to navigate a wide variety of scenarios involving large files. It&rsquo;s also a good example of how the &ldquo;glue layer&rdquo; between LLMs and tools is actually a complex, sophisticated application layer rather than merely glue.</p> <h2 id="how-is-this-working">How is this working?</h2> <p>This has worked well. In particular, one of our internal workflows oriented around giving feedback about documents attached to a ticket, in comparison to other similar, existing documents. The workflow simply did not work at all prior to this approach, and now works fairly well without workflow-specific support for handling these sorts of large files, because the <code>large_files</code> skill handles that in a reusable fashion without workflow authors being aware of it.</p> <h2 id="what-next">What next?</h2> <p>Generally, this feels like a stand-alone set of functionality that doesn&rsquo;t require significant future investment, but there are three places where we will need to continue building:</p> <ol> <li>Until we add subagent support, our capabilities are constrained. In many cases, the ideal scenario of dealing with a large file is opening it in a subagent with a large context window, asking that subagent to summarize its contents, and then taking that summary into the primary agent&rsquo;s context window.</li> <li>It seems likely that <code>extract_file</code> should be modified to return a referencable, virtual <code>file_id</code> that is used with <code>peek_file</code> and <code>load_file</code> rather than returning contents directly. This would make for a more robust tool even when extracting from very large files. In practice, extracted content has always been quite compact.</li> <li>Finally, operating within an AWS Lambda requires pure Python packages, and ultimately pure Python is not very fast at parsing complex XML-derived document formats like DOCX. Ultimately, we could solve this by adding a layer to our lambda with the <code>lxml</code> dependencies in it, and at some point we might.</li> </ol> <p>Altogether, a very helpful extension for our internal workflows.</p>Building an internal agent: Adding support for Agent Skillshttps://lethain.com/agents-skills/Fri, 26 Dec 2025 07:00:00 -0700https://lethain.com/agents-skills/<p>When Anthropic introduced <a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview">Agent Skills</a>, I was initially a bit skeptical of the problem they solved&ndash;can we just use prompts and tools?&ndash;but I&rsquo;ve subsequently come to appreciate them, and have explicitly implemented skills in our internal agent framework. This post talks about the problem skills solves, how the engineering team at Imprint implemented them, how well they&rsquo;ve worked for us, and where we might work with them next.</p> <p><em>This is part of the <a href="https://lethain.com/agents-series/">Building an internal agent</a> series.</em></p> <h2 id="what-problem-do-agent-skills-solve">What problem do Agent Skills solve?</h2> <p>Agent Skills are a series of techniques that solve two important workflow problems:</p> <ol> <li>use <a href="https://lethain.com/agents-large-files/">progressive disclosure</a> to more effectively utilize the constrained context windows, minimizing conflicting or unnecessary context in the context window</li> <li>provide reusable snippets for solving recurring problems to avoid individual workflow-creators having to solve recurring problems like e.g. Slack formatting or dealing with large files</li> </ol> <p>These problems initially seemed very insignificant when we started building out our internal workflows, but once the number of internal workflows reached into the dozens, both become difficult to manage. Without reusable snippets, I lost the leverage to improve all workflows at once, and without progressive disclosure the agents would get a vast amount of irrelevant content that could confuse them, particularly when it came to things like inconsistencies between Markdown and slack&rsquo;s <code>mrkdwn</code> formatting language, both of which are important to different tools used by our workflows.</p> <h2 id="how-we-implemented-agent-skills">How we implemented Agent Skills</h2> <p>As a disclaimer, I recognize that it&rsquo;s not <em>necessary</em> to implement agent skills, as you can integrate with e.g. <a href="https://platform.claude.com/docs/en/build-with-claude/skills-guide">Claude&rsquo;s Agent Skills support for APIs</a>. However, one of our design decisions is being largely platform agnostic, such that we can switch across model providers, and consequently we decided to implement skills within our framework.</p> <p>With that out of the way, we started implementing by reviewing the Agent Skills documentation at <a href="https://agentskills.io/home">agentskills.io</a>, and cloning their Python reference implementation <a href="https://github.com/agentskills/agentskills/tree/main/skills-ref">skills-ref</a> into our repository to make it accessible to Claude Code.</p> <p>The resulting implementation has these core features:</p> <ol> <li> <p>Skills are in <code>skills/</code> repository, with each skill consisting of its own sub-directory with a <code>SKILL.md</code></p> </li> <li> <p>Each skill is a Markdown file with metadata along these lines:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>--- </span></span><span style="display:flex;"><span><span style="color:#f92672">name</span>: <span style="color:#ae81ff">pdf-processing</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">description</span>: <span style="color:#ae81ff">Extract text and tables...</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">metadata</span>: </span></span><span style="display:flex;"><span> <span style="color:#f92672">author</span>: <span style="color:#ae81ff">example-org</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">version</span>: <span style="color:#e6db74">&#34;1.0&#34;</span> </span></span><span style="display:flex;"><span>--- </span></span></code></pre></div></li> <li> <p>The list of available skills&ndash;including their description from metadata&ndash;is injected into the system prompt at the beginning of each workflow, and the <code>load_skills</code> tool is available to the agent to load the entire file into the context window.</p> </li> <li> <p>Updated workflow configuration to optionally specify required, allowed, and prohibited skills to modify the list of exposed skills injected into the system prompt.</p> <p>My guess is that requiring specific skills for a given workflow is a bit of an anti-pattern, &ldquo;just let the agent decide!&rdquo;, but it was trivial to implement and the sort of thing that I could imagine is useful in the future.</p> </li> <li> <p>Used the Notion MCP to retrieve all the existing prompts in our prompt repository, identify existing implicit skills in the prompts we had created, write those initial skills, and identify which Notion prompts to edit to eliminate the now redundant sections of their prompts.</p> </li> </ol> <p>Then we shipped it into production.</p> <h2 id="how-theyve-worked">How they&rsquo;ve worked</h2> <p>Humans make mistakes <em>all the time</em>. For example, I&rsquo;ve seen many dozens of JIRA tickets from humans that don&rsquo;t explain the actual problem they are having. People are used to that, and when a human makes a mistake, they blame the human. However, when agents make a mistake, a surprising percentage of people view it as a fundamental limitation of agents as a category, rather than thinking that, &ldquo;Oh, I should go update that prompt.&rdquo;</p> <p>Skills have been extremely helpful as the tool to continue refining down these edge cases where we&rsquo;ve relied on implicit behavior because specifying the exact behavior was simply overwhelming. As one example, we ask that every Slack message end with a link to the prompt that drove the response. That always worked, but the details of the formatting would vary in an annoying, distracting way: sometimes it would be the equivalent of <code>[title](link)</code>, sometimes <code>link</code>, sometimes <code>[link](link)</code>. With skills, it is now (almost always) consistent, without anyone thinking to include those instructions in their workflow prompts.</p> <p>Similarly, handling large files requires a series of different tools that benefit from In-Context Learning (aka ICL, which is a fancy term for including a handful of examples of correct and incorrect usage), which absolutely no one is going to add to their workflow prompt but is extremely effective at improving how the workflow uses those tools.</p> <p>For something that I was initially deeply skeptical about, I now wish I had implemented skills much earlier.</p> <h2 id="where-we-might-go-next">Where we might go next</h2> <p>While our skills implementation is working well today, there are a few opportunities I&rsquo;d like to take advantage of in the future:</p> <ol> <li> <p>Add a <code>load_subskill</code> skill to support files in <code>skills/{skill}/*</code> beyond the <code>SKILL.md</code>. So far, this hasn&rsquo;t been a major blocker, but as some skills get more sophisticated, the ability to split varied use-cases into distinct files would improve our ability to use skills for progressive disclosure</p> </li> <li> <p>One significant advantage that Anthropic has over us is their sandboxed Python interpreter, which allows skills to include entire Python scripts to be specified and run by tools. For example, a script for parsing PDFs might be included in a skill, which is extremely handy. We don&rsquo;t currently have a sandboxed interpreter handy for our agents, but this could, in theory anyway, significantly cut down on the number of custom skills we need to implement.</p> <p>At a minimum, it would do a much better job at operations that require reliable math versus relying on the LLM to do its best at performing math-y operations.</p> </li> </ol> <p>I think both of these are actually pretty straightforward to implement. The first is just a simple feature that Claude could implement in a few minutes. The latter <em>feels annoying</em> to implement, but could also be implemented in less than an hour by running a second lambda running Nodejs with <a href="https://pyodide.org/en/stable/usage/index.html">Pyodide</a>, and exposing access to that lambda as a tool. It&rsquo;s just so inelegant for a Python process to call a Nodejs process to run sandboxed Python that I haven&rsquo;t done it quite yet.</p>