The Agent Loop

How agentic coding assistants use a read-think-act-observe cycle to solve multi-step programming tasks

3-25 Loop iterations per task
5-40 Tool calls per task
~80% Tokens are tool results
<2s Avg tool execution time
200K Context window (tokens)

Why the Loop Pattern Exists

Large language models are powerful reasoners, but they are fundamentally text-in, text-out systems. They cannot open a file, run a terminal command, or check if their code compiles. They can only produce strings of characters. This creates a gap: the model can figure out what needs to be done but has no way to do it.

The agent loop bridges this gap by wrapping the LLM in a cycle. A harness program sits between the model and the outside world. When the model decides it needs to read a file, it emits a structured tool call. The harness executes that call against the real filesystem and feeds the result back as a new message. Now the model can see what the file contains, reason about the next step, and emit another tool call. This continues until the model decides the task is complete and responds with plain text instead of a tool call.

Think of it this way: the LLM is a brain in a jar. The tool system gives it hands to manipulate code and eyes to observe the results. The loop is the nervous system connecting them. Without it, the model can only give you advice. With it, the model can actually fix your code.

The Core Loop Visualized

Watch a token circulate through the agent loop. Each phase lights up as the agent processes a real task: the user asks to fix a failing test, and the agent searches, reads, diagnoses, edits, and verifies.

3x
Execution Trace

Anatomy of a Single Iteration

Each trip around the loop has four distinct phases. Understanding them is key to understanding why agents behave the way they do.

1

Think

The model receives the full conversation so far, including all prior tool results. It reasons about what to do next. This is where chain-of-thought happens. The model might consider multiple approaches, weigh trade-offs, or revise its plan based on new information. The output is either a tool call (continuing the loop) or a text response (ending it).

What consumes tokens: The entire conversation history, system prompt, tool definitions, and all prior tool results. On a complex task, the "think" input can easily reach 50K-100K tokens.
2

Select Tool

Rather than producing free-form text, the model emits a structured JSON object specifying which tool to call and with what arguments. This is not just string matching; the model has internalized the tool schemas from the system prompt and generates valid calls. It can call file search, code editing, terminal execution, web fetching, or spawn sub-agents.

Key design choice: Some systems allow multiple tool calls per turn (parallel execution), while others are strictly one-at-a-time. Parallel calls are faster but make error handling more complex.
3

Execute

The harness program (not the model) executes the tool call. This is the only point where real side effects happen. A file edit modifies actual bytes on disk. A bash command runs a real process. The model never touches the filesystem directly; the harness acts as a privileged intermediary, often checking permissions before executing.

Safety layer: The harness can reject tool calls, require user approval, sandbox commands, or apply allow-lists. This is where the permission model lives.
4

Observe

The tool's output (file contents, command output, error messages, search results) is appended to the conversation as a new message. The model now has fresh information it did not have before. This is what makes the loop powerful: each observation changes the model's understanding, allowing it to adapt its strategy in real time.

Truncation: Large tool outputs (e.g., a 5000-line file) are often truncated or summarized to fit within the context window. The harness decides what to keep and what to drop.

When Does the Loop Stop?

The loop is not infinite. Several conditions cause it to terminate, and understanding these explains many observed agent behaviors.

Task Complete

The model decides it has finished. Instead of emitting a tool call, it produces a text response summarizing what was done. This is the happy path. The model has agency over when to stop, which means it sometimes stops too early (missing edge cases) or too late (over-engineering).

!

Unrecoverable Error

A tool returns an error the model cannot work around, such as a permission denial on a protected file, a network timeout when fetching a required resource, or a compilation failure it cannot diagnose. The model will typically report the error to the user.

User Interruption

The user presses Escape or a stop button. Because the loop streams its reasoning in real time, users can watch the agent's thought process and interrupt if it goes off track. This is a critical feedback mechanism that prevents runaway loops.

$

Token Budget Exhausted

The context window fills up. On a long task with many tool results, the conversation can exceed 200K tokens. At that point, older messages must be evicted or summarized. If the essential context gets evicted, the agent may lose track of the task. Some systems set a hard turn limit as a guardrail.

Planning vs. Step-by-Step Execution

There are two dominant strategies for how agents approach multi-step tasks, and most real systems use a hybrid.

Plan-First Approach

Analyze the full task
Create a todo list of steps
Execute each step in order
Check off completed items

Strengths

  • Provides a roadmap the user can review before execution starts
  • Helps prevent forgetting steps in complex tasks
  • Makes progress visible and predictable

Weaknesses

  • Plans often become stale as the agent discovers new information
  • Consumes tokens maintaining a todo structure
  • Can lead to rigid execution when flexibility is needed

Reactive Step-by-Step

Look at what's in front of me
Do the most obvious next thing
Observe what happened
Decide if I'm done or keep going

Strengths

  • Adapts naturally as new information is discovered
  • No overhead from maintaining a plan structure
  • Works well for exploratory and debugging tasks

Weaknesses

  • Can lose the forest for the trees on large tasks
  • Harder for users to predict what the agent will do next
  • Risk of going in circles without a high-level map

In practice, effective agents blend both approaches. They form a loose mental plan during the first "think" phase but stay flexible enough to deviate when reality diverges from expectations. The best results come from planning at the right granularity: broad enough to maintain direction, detailed enough for the immediate next step.

Backtracking and Error Recovery

One of the most powerful properties of the agent loop is its ability to recover from mistakes. Because every tool result (including errors) is fed back to the model, the agent can see exactly what went wrong and try something different. This is fundamentally different from a script, which would simply fail and stop.

Example: A Failed Approach
Attempt 1
Agent tries to fix the bug by editing auth.ts line 42
Test Fails
Runs tests, gets: TypeError: Cannot read property 'token' of undefined
Re-Think
The model sees the error, realizes the root cause is upstream in session.ts, not in auth.ts
Attempt 2
Reverts the auth.ts change and fixes the null check in session.ts
Tests Pass
All 47 tests green. The agent reports success with a summary of what it changed and why.

This backtracking ability is why agent loops handle ambiguous tasks far better than simple code generation. The model does not need to get it right on the first try. It just needs to be able to recognize failure and adjust.

Streaming: Transparency Builds Trust

A critical UX decision in agent systems is making the loop visible to the user. Rather than running silently and returning a final result, streaming agents show their reasoning token by token. The user watches the agent think through the problem, sees it pick a tool, observes the tool output, and follows the next reasoning step.

Agent Output (streaming)

Streaming serves three purposes. First, it lets users verify the agent's understanding before it acts. If the model misunderstands the task, the user can interrupt immediately rather than waiting for an incorrect result. Second, it builds trust by showing the agent's work. A black box that silently edits your codebase is terrifying; one that explains each step is a collaborator. Third, it reduces perceived latency. A 30-second task feels faster when you can watch progress in real time.

Frequently Asked Questions

What happens if the agent gets stuck in an infinite loop?

Several safeguards prevent this. Most agent systems impose a maximum turn count (often 50-200 iterations). The token budget acts as a natural ceiling since every iteration consumes context window space. Users can also interrupt at any time via keyboard shortcuts. Additionally, well-tuned models learn to recognize circular patterns in their own output and break out by trying a different approach or asking the user for clarification.

How does the agent decide which tool to use?

The model sees tool descriptions in its system prompt, which include the tool name, parameter schema, and usage guidelines. Based on its current reasoning about the task, it generates a tool call as structured output. This is not a lookup table or decision tree; it is the same next-token prediction process the model uses for all text generation, but constrained to output valid tool-call JSON. The model learns tool selection patterns during training and from the detailed instructions in its prompt.

Why not just generate all the code at once instead of using a loop?

One-shot code generation works for small, well-defined tasks (write a sort function, create a React component). But real coding tasks are rarely self-contained. You need to understand the existing codebase, find the right files, check how functions are called, verify your change does not break anything, and handle edge cases you discover along the way. The loop lets the agent gather information incrementally, exactly like a human developer who reads code, makes a change, runs tests, and iterates. Without the loop, you would have to stuff the entire codebase into the prompt, which is impractical for any non-trivial project.

How much of the context window do tool results consume?

On a typical task, tool results account for roughly 70-85% of all tokens in the conversation. A single file read can be 500-5000 tokens. A grep search across a codebase might return 2000 tokens of matches. Command output from running a test suite can easily be 3000+ tokens. This is why context window management is so critical; the agent must be selective about which tools it calls and smart about which results to keep as the conversation grows. Most harnesses truncate large outputs and may summarize older results to free up space.

Can the agent undo its own changes if something goes wrong?

Yes, in several ways. The agent can use git to revert files, it can re-edit a file to restore its previous content, or it can use version control to reset to a known good state. Some harnesses also keep internal snapshots of file state before modifications, providing an implicit undo capability. The backtracking pattern described above is the model-level version of undo: the agent sees a test fail, recognizes its change was wrong, and applies a different fix. This combination of model-level reasoning and system-level safeguards makes the loop robust against mistakes.