Context Window Engineering

A 200K token window sounds enormous — until you start filling it. Here is how coding agents budget every token.

The Context Budget Problem

Think of the context window as a fixed-size workspace. Every piece of information the agent needs must fit inside it: the instructions it was given, the user's request, every file it reads, every search result, and its own reasoning. A 200K token window seems vast, but the overhead adds up fast. The system prompt alone can consume 10,000 tokens. Tool definitions — the JSON schemas describing each capability — take another 5,000. A single file read might be 3,000-5,000 tokens. After 20-30 tool calls, you are brushing up against the ceiling.

System Prompt

~10K tokens (fixed)

Tool Definitions

~5K tokens (fixed)

User Messages

~1-2K tokens (variable)

Assistant Reasoning

~20K tokens (grows)

Tool Calls

~2-10K tokens (grows)

Tool Results

~50-150K tokens (biggest consumer)

Remaining Space

Whatever is left

Tool results dominate the budget. A single grep across a large codebase can return thousands of lines. A file read of a 500-line source file is roughly 4,000 tokens. The agent must be strategic about what it retrieves and when.

Interactive Context Window Simulation

Watch a simulated agent session fill the context window. Use the slider to step through time. As the window fills, older tool results get compressed — notice how orange blocks shrink and merge while recent edits and user messages stay intact.

Session Progress: Step 0 / 30

Context Window (200K tokens) 0 / 200,000 tokens used (0%)

System Prompt User Messages Tool Calls Tool Results Assistant Reasoning Compressed

Move the slider or press Play to begin the simulation.

What Fills the Context

The context window is a single, ordered sequence of messages. Every API call sends the full conversation history (up to the limit) to the model. Here is what that sequence typically contains, from top to bottom:

System Message (Fixed)

System prompt: Behavioral instructions, safety rules, coding style preferences, environment details (OS, shell, working directory). This is always first and always present.

Tool definitions: JSON schemas for every available tool — Read, Edit, Write, Bash, Grep, Glob, and any MCP tools. Each schema describes parameters, constraints, and usage notes.

CLAUDE.md contents: Project-specific instructions loaded from disk files. These are injected into the system prompt so the agent "remembers" project conventions.

~10-15K tokens

User Messages

The task description. Often short: "fix the login bug" or "add pagination to the API." Occasionally long if the user pastes error logs or specifications.

~200-2K tokens

Assistant Reasoning

The model's thinking: analyzing the task, deciding which files to look at, forming a plan. This accumulates over multiple turns. Extended thinking (if enabled) adds more.

~500-2K per turn

Tool Call

A structured request: {"tool": "Grep", "pattern": "handleLogin", "path": "src/"}. Compact — usually a few hundred tokens at most.

~100-500 tokens

Tool Result

The output returned by the tool. This is where the budget gets eaten. A grep match across many files can be enormous. File reads of large source files are thousands of tokens. Even a simple git diff on a meaningful changeset can be 5K+ tokens.

~500-10K+ tokens each

...this cycle repeats 10-50+ times per session...

Retrieval vs. Loading

A naive approach would be to dump the entire codebase into the context window at the start. But even a modest project with 50 source files would consume the entire 200K budget before the agent does any work. Instead, agents treat the codebase like a database — they search for what they need on demand.

Preloading (impractical)

src/auth.ts (2K)

src/api.ts (1.8K)

src/db.ts (3K)

src/utils.ts (1.5K)

src/routes.ts (2.5K)

...47 more files

stuff all into context

Context Full (no room to think)

On-Demand Retrieval (how it works)

src/auth.ts

src/api.ts

src/db.ts

src/utils.ts

src/routes.ts

...47 more files

grep "handleLogin" → 3 matches

Only relevant code loaded

The Search Toolkit

Grep

Pattern search across files. Like a search engine for code — find every reference to a function, variable, or error message without loading entire files.

Cost: proportional to number of matches

Glob

Find files by name pattern. Useful for discovering project structure — "show me all test files" or "find all TypeScript files in src/". Returns paths, not content.

Cost: very low (just file paths)

Read

Load a specific file or a range of lines within it. Supports reading just lines 50-80 instead of the whole file — surgical precision when you know exactly what you need.

Cost: proportional to lines read

Agent (sub-agent)

Delegate a research task to a sub-agent that gets its own context window. The sub-agent searches, reads, analyzes — then returns a compact summary to the parent. Isolation prevents one research task from polluting the main context.

Cost: only the returned summary

Compression Strategies

When the context approaches its limit, something has to give. The system cannot just truncate — that would lose recent work. Instead, it applies selective compression: keeping high-value content and summarizing or evicting low-value content.

Before Compression

System Prompt (10K)

User: "fix login bug" (200)

Plan: check auth module... (800)

Grep "handleLogin" (150)

Grep results: 45 matches across 12 files... (8K)

Analysis: the bug is in auth.ts... (600)

Read src/auth.ts (100)

Full file contents: 350 lines... (4K)

Found the issue on line 142... (500)

Read src/session.ts (100)

Full file contents: 280 lines... (3.2K)

Cross-referencing session handling... (700)

Edit src/auth.ts line 142 (300)

Edit applied successfully (100)

compression

After Compression

System Prompt (10K)

User: "fix login bug" (200)

Summary: searched for handleLogin, found 45 matches. Analyzed auth.ts and session.ts. (300)

Found the issue on line 142... (500)

Edit src/auth.ts line 142 (300)

Edit applied successfully (100)

Content Value Hierarchy

Preserved System prompt, user instructions, recent edits, file modifications still in progress

Summarized Older assistant reasoning, intermediate search results that led to decisions

Evicted Large grep outputs already acted upon, file reads of files no longer being edited, verbose command output

The "Lost in the Middle" Problem

Research has shown that LLMs have a U-shaped attention pattern for long contexts: they attend strongly to content at the very beginning and the very end, but pay less attention to content in the middle. This is not a bug — it is a consequence of how positional encodings and attention patterns work in transformer architectures.

High Attention

System prompt, core instructions

Lower Attention

Older tool results, mid-session reasoning

High Attention

Recent messages, latest user input

This is why the system prompt is placed first (always attended to) and why recent user messages are kept at the end. Middle content — old search results, intermediate reasoning — is the safest to compress because the model was already paying less attention to it.

Token Counting and Budget Decisions

The agent harness (the code running the agent loop) tracks cumulative token usage after every API call. This is not guesswork — the API response includes exact token counts for both the input (context sent) and output (tokens generated). The harness uses these numbers to decide when compression is needed.

70% - monitor

85% - compress

95% - aggressive eviction

0 / 200,000 tokens

Track: after each API round-trip, the harness records input_tokens and output_tokens from the response.

Evaluate: if input_tokens exceeds a threshold (say 85% of the window), trigger compression.

Compress: summarize or remove old tool results. Preserve system prompt, user messages, recent edits, and the last few turns intact.

Continue: the next API call sends the compressed history. The model sees a smaller context with all critical information preserved.

Auto-Memory: Escaping the Context Window

The context window resets between conversations. Everything the agent learned — project conventions, file locations, common patterns — is lost. Auto-memory solves this by persisting important information to disk files (like CLAUDE.md or .claude/ memory files) that get loaded into future sessions.

Session 1

Agent discovers: "This project uses Vitest, not Jest"

Agent discovers: "API routes are in src/server/api/"

Agent discovers: "The team prefers named exports"

writes to CLAUDE.md

Disk (Persistent)


# Project Notes

- Test runner: Vitest (not Jest)

- API routes: src/server/api/

- Style: named exports preferred

Session 2

Loaded from CLAUDE.md into system prompt

Agent immediately knows project conventions

No need to re-discover through expensive searches

This is a form of long-term memory that trades a small, fixed amount of context space (the memory file contents in the system prompt) for potentially thousands of tokens that would otherwise be spent re-discovering the same facts every session.

Frequently Asked Questions

What happens when the context window is completely full?

The harness never lets it reach absolute capacity — it triggers compression well before the limit. If the context somehow gets too large, the system will aggressively summarize or drop the oldest, lowest-value tool results. The system prompt and recent turns are always preserved. In extreme cases, the agent may suggest starting a new conversation to get a fresh context window.

Does a bigger context window always mean better performance?

Not necessarily. Larger windows enable handling bigger tasks, but they also increase latency (more tokens to process means slower responses) and cost (API pricing is per-token). The "lost in the middle" effect also means that stuffing more content in does not guarantee the model will actually use it. Targeted retrieval of relevant content often outperforms dumping everything into a huge window.

How does context compression differ from just truncating old messages?

Truncation is blind — it drops content regardless of importance. Compression is selective. It might replace a 5,000-token grep output with a 200-token summary: "searched for handleLogin, found matches in auth.ts (line 142), session.ts (line 87), and middleware.ts (line 23)." The key facts are preserved; the raw output is not. This is typically done by asking the LLM itself to summarize the older content.

Can the agent re-read a file it already read but that got compressed away?

Yes, and it does this regularly. If the agent needs to reference file contents that were evicted during compression, it simply reads the file again. The file on disk has not changed (unless the agent edited it), so the re-read is cheap in terms of wall-clock time. The tradeoff is that it costs context tokens again — but by that point, the compression freed up space to accommodate the re-read.

Why not just use RAG (Retrieval Augmented Generation) instead of tool-based search?

Traditional RAG systems pre-embed documents into a vector database and retrieve by semantic similarity. This works well for documentation lookup but poorly for code search — you often need exact matches (function names, error strings) rather than semantic similarity. Grep-based search is more precise for code. That said, some agents use hybrid approaches: semantic search for discovering relevant concepts, exact search for pinpointing specific code.