Building an internal agent: Context window compaction
Although my model of choice for most internal workflows remains ChatGPT 4.1 for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.
This is part of the Building an internal agent series.
Why compaction matters
Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you’ll find for internal agents, ultimately it’s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.
How we implemented it
Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more workflows with large files.
In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after implementing support for subagents, but I was never able to ground that intuition in a concrete reason why it was necessary, and we implemented compaction without subagent support.
The gist of our approach to compaction is:
After every user message (including tool responses), add a system message with the consumed and available tokens in the context window. In that system message, we also include the updated list of available
filesthat can be read fromUser messages and tool responses greater than 10,000 tokens are exposed as a new “virtual file”, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens (both 1k and 10k are configurable values)
Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools, as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal tools. These tools were
file_readwhich can read entire files, lines ranges within a file, or byte ranges within a file, andfile_regexwhich is similar but performs a regex scan against a file up to a certain number of matches.Every use of a file is recorded in the
filesdata, so the agent knows what has and hasn’t been read into the context window (particularly relevant for preloaded files), along the lines of:<files> <file id='a' name='image.png' size='32kb'> <file_read /> </file> <file id='a' name='image.png' size='32kb'> <file_read start_line=10 end_line=20 /> </file> </files>This was surprisingly annoying to implement cleanly, mostly because I came onto this idea after iteratively building the agent as a part-time project for several months. If I could start over, I would start with files as a core internal construct, rather than adding it on later.
If a message pushed us over 80% (configurable value) of the model’s available context window, use the compaction prompt that Reddit claims Claude Code uses. The prompt isn’t particularly special, it just already exists and seems pretty good
After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost
Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.
How is it working? / What’s next?
Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it’s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don’t require large context windows, and, second, honestly this seems to work quite well.
If context windows get significantly larger in the future, which I don’t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.