Context Window Engineering
A 200K token window sounds enormous — until you start filling it. Here is how coding agents budget every token.
The Context Budget Problem
Think of the context window as a fixed-size workspace. Every piece of information the agent needs must fit inside it: the instructions it was given, the user's request, every file it reads, every search result, and its own reasoning. A 200K token window seems vast, but the overhead adds up fast. The system prompt alone can consume 10,000 tokens. Tool definitions — the JSON schemas describing each capability — take another 5,000. A single file read might be 3,000-5,000 tokens. After 20-30 tool calls, you are brushing up against the ceiling.
Tool results dominate the budget. A single grep across a large codebase can return
thousands of lines. A file read of a 500-line source file is roughly 4,000 tokens. The agent
must be strategic about what it retrieves and when.
Interactive Context Window Simulation
Watch a simulated agent session fill the context window. Use the slider to step through time. As the window fills, older tool results get compressed — notice how orange blocks shrink and merge while recent edits and user messages stay intact.
What Fills the Context
The context window is a single, ordered sequence of messages. Every API call sends the full conversation history (up to the limit) to the model. Here is what that sequence typically contains, from top to bottom:
Retrieval vs. Loading
A naive approach would be to dump the entire codebase into the context window at the start. But even a modest project with 50 source files would consume the entire 200K budget before the agent does any work. Instead, agents treat the codebase like a database — they search for what they need on demand.
The Search Toolkit
Compression Strategies
When the context approaches its limit, something has to give. The system cannot just truncate — that would lose recent work. Instead, it applies selective compression: keeping high-value content and summarizing or evicting low-value content.
Content Value Hierarchy
The "Lost in the Middle" Problem
Research has shown that LLMs have a U-shaped attention pattern for long contexts: they attend strongly to content at the very beginning and the very end, but pay less attention to content in the middle. This is not a bug — it is a consequence of how positional encodings and attention patterns work in transformer architectures.
This is why the system prompt is placed first (always attended to) and why recent user messages are kept at the end. Middle content — old search results, intermediate reasoning — is the safest to compress because the model was already paying less attention to it.
Token Counting and Budget Decisions
The agent harness (the code running the agent loop) tracks cumulative token usage after every API call. This is not guesswork — the API response includes exact token counts for both the input (context sent) and output (tokens generated). The harness uses these numbers to decide when compression is needed.
Auto-Memory: Escaping the Context Window
The context window resets between conversations. Everything the agent learned — project conventions,
file locations, common patterns — is lost. Auto-memory solves this by persisting important
information to disk files (like CLAUDE.md or .claude/ memory files)
that get loaded into future sessions.
# Project Notes
- Test runner: Vitest (not Jest)
- API routes: src/server/api/
- Style: named exports preferred
This is a form of long-term memory that trades a small, fixed amount of context space (the memory file contents in the system prompt) for potentially thousands of tokens that would otherwise be spent re-discovering the same facts every session.
Frequently Asked Questions
What happens when the context window is completely full?
The harness never lets it reach absolute capacity — it triggers compression well before the limit. If the context somehow gets too large, the system will aggressively summarize or drop the oldest, lowest-value tool results. The system prompt and recent turns are always preserved. In extreme cases, the agent may suggest starting a new conversation to get a fresh context window.
Does a bigger context window always mean better performance?
Not necessarily. Larger windows enable handling bigger tasks, but they also increase latency (more tokens to process means slower responses) and cost (API pricing is per-token). The "lost in the middle" effect also means that stuffing more content in does not guarantee the model will actually use it. Targeted retrieval of relevant content often outperforms dumping everything into a huge window.
How does context compression differ from just truncating old messages?
Truncation is blind — it drops content regardless of importance. Compression is selective. It might replace a 5,000-token grep output with a 200-token summary: "searched for handleLogin, found matches in auth.ts (line 142), session.ts (line 87), and middleware.ts (line 23)." The key facts are preserved; the raw output is not. This is typically done by asking the LLM itself to summarize the older content.
Can the agent re-read a file it already read but that got compressed away?
Yes, and it does this regularly. If the agent needs to reference file contents that were evicted during compression, it simply reads the file again. The file on disk has not changed (unless the agent edited it), so the re-read is cheap in terms of wall-clock time. The tradeoff is that it costs context tokens again — but by that point, the compression freed up space to accommodate the re-read.
Why not just use RAG (Retrieval Augmented Generation) instead of tool-based search?
Traditional RAG systems pre-embed documents into a vector database and retrieve by semantic similarity. This works well for documentation lookup but poorly for code search — you often need exact matches (function names, error strings) rather than semantic similarity. Grep-based search is more precise for code. That said, some agents use hybrid approaches: semantic search for discovering relevant concepts, exact search for pinpointing specific code.