The Agent Loop in Practice

Topics Covered

Implementing the Core Loop

The Basic Loop Structure

Why the Message History Matters

Tool Call Format

Parallel Tool Calls

Error Handling in the Loop

Multi-Agent Patterns

Stop Conditions and Iteration Limits

Three Types of Stop Conditions

Why Iteration Limits Are Non-Negotiable

Choosing the Right Limits

Graceful Degradation on Limit Hit

Combining Multiple Safety Stops

State Management

The Context Window Problem

Strategies for Managing State

State Beyond the Message History

Token Counting and Budget Awareness

The Scratchpad Pattern

Debugging Agent Behavior

Non-Determinism Is the Core Challenge

Trace Logging

Replay and Deterministic Testing

Step-Through Debugging

Common Agent Bugs

Building an Evaluation Harness

Observability in Production

The agent loop is the simplest useful pattern in agentic AI, and also the most important one to get right. At its core, every AI agent is a while loop: call the LLM, check if it wants to use a tool, execute the tool if so, feed the result back, and repeat until the LLM produces a final text response.

Agent loop executing three steps: LLM call, tool execution, and final response

The Basic Loop Structure

Here is the agent loop in pseudocode:

python
1def agent_loop(user_message, tools):
2    messages = [{"role": "user", "content": user_message}]
3
4    while True:
5        response = call_llm(messages, tools)
6
7        if response.has_tool_calls:
8            for tool_call in response.tool_calls:
9                result = execute_tool(tool_call)
10                messages.append(tool_call_message)
11                messages.append(tool_result_message(result))
12        else:
13            # Model returned text — task is complete
14            return response.text

Three things happen each iteration:

  1. Call the LLM with the full conversation history (system prompt + user message + all prior tool calls and results)
  2. Check the response type: did the model return a tool call or a text response?
  3. Branch: if tool call, execute it and append both the call and its result to the message history, then loop. If text, return it to the user.

The model decides what to do at each step. It sees the entire conversation history (including prior tool results) and chooses whether to call another tool or respond with text. This is what makes agents different from traditional programs: the control flow is determined by the LLM at runtime, not by the programmer at write time.

Why the Message History Matters

The message list is the agent's working memory. Every tool call and result becomes part of the conversation that the LLM sees on the next iteration. This is how the agent "remembers" what it has done and what it has learned.

If you read a file in iteration 1, the file contents are in the message history when the LLM decides what to do in iteration 2. If a database query returned an error in iteration 3, the LLM sees that error and can decide to fix the query in iteration 4. The agent loop's power comes from this feedback cycle: act, observe, reason, act again.

Tool Call Format

When the LLM wants to use a tool, it returns a structured response specifying the tool name and arguments. The format varies by provider but the concept is universal:

json
1{
2  "tool_calls": [
3    {
4      "name": "search_database",
5      "arguments": {
6        "query": "SELECT * FROM users WHERE email = '[email protected]'"
7      }
8    }
9  ]
10}

The orchestration code (your agent loop) parses this, executes the tool, and returns the result as a new message. The LLM never executes tools directly, it only expresses intent. Your code decides whether to honor that intent, which is the foundation of agent safety.

Key Insight

The agent loop inverts the control flow of traditional programming. In a normal program, the developer writes if/else branches that decide what happens next. In an agent, the LLM decides what happens next based on the conversation history. Your code provides the tools and enforces the boundaries, but the LLM chooses the path through the problem space.

Parallel Tool Calls

Some providers support parallel tool calling. The LLM returns multiple tool calls in a single response when the calls are independent. For example, if an agent needs to check both a user's order history and their account balance, it can request both simultaneously rather than sequentially.

Supporting parallel tool calls in your loop means iterating over response.tool_calls (plural) and executing them concurrently. This reduces the number of LLM round-trips and speeds up the agent significantly for tasks with independent subtasks.

Error Handling in the Loop

Tools fail. APIs return 500 errors, databases time out, files do not exist. How you handle tool failures determines whether your agent is fragile or resilient.

The most important rule: never crash the loop on a tool error. Instead, return the error as the tool result:

python
1try:
2    result = execute_tool(tool_call)
3except Exception as e:
4    result = f"Error: {type(e).__name__}: {str(e)}"

The LLM sees the error in its message history and decides what to do. Retry with different arguments, try a different tool, or inform the user that the operation failed. This keeps the agent loop running and lets the model reason about failures the same way it reasons about successes.

Never silently swallow errors. If a database query fails and you return an empty result instead of the error message, the LLM thinks the query returned no data. It might tell the user "no results found" when the real answer is "the database was unreachable." Return the error explicitly so the model can communicate accurately.

Multi-Agent Patterns

Not every task fits a single agent loop. Complex tasks benefit from multiple agents working together:

  • Orchestrator-worker: A top-level agent breaks the task into subtasks and delegates each to a specialized worker agent. The orchestrator collects results and synthesizes a final answer. Each worker has a focused system prompt and toolset.
  • Sequential pipeline: The output of one agent becomes the input to the next. Agent 1 researches, Agent 2 analyzes the research, Agent 3 writes the report. Each agent has a narrow scope and clear input/output contract.
  • Parallel fan-out: The orchestrator sends the same task to multiple agents with different strategies. The best result wins. Useful for tasks where the optimal approach is uncertain.

These patterns are compositions of the basic agent loop. Each agent runs the same while loop: call LLM, check for tool calls, execute, repeat. The complexity comes from how agents communicate, not from the loop itself.

The key insight for multi-agent systems: keep each agent's scope narrow. A single agent with 20 tools and a vague system prompt performs worse than three agents with 5-7 tools each and focused prompts. The orchestrator decides which specialist to call, and each specialist does one thing well. This mirrors software engineering best practices: small, focused components compose better than monolithic ones.

When designing multi-agent systems, define clear input and output contracts between agents. Agent 1 produces a structured output that Agent 2 can parse without ambiguity. If the contract is unclear, agents miscommunicate and the system produces garbage. This is why structured outputs (covered in the next lesson) are critical for multi-agent architectures: each inter-agent message must be machine-parseable.

The agent loop runs until a stop condition is met. Getting stop conditions right is critical: too aggressive and the agent quits before finishing; too permissive and a confused agent loops forever, burning tokens and money.

Iteration limit halting an agent that would otherwise loop indefinitely

Three Types of Stop Conditions

Explicit stop: The LLM signals that the task is complete by returning a text response instead of a tool call. This is the normal, happy-path exit. The model has gathered enough information, performed the required actions, and is ready to deliver a final answer. In well-designed agents, this is how 90%+ of runs end.

Implicit stop: The LLM returns a response with no tool calls and no meaningful text, sometimes just an empty response or a response that does not advance the task. Your orchestration code detects this stall and terminates the loop. This catches cases where the model is confused but not looping.

Safety stop: Hard limits enforced by the orchestration code, not the model:

python
1MAX_ITERATIONS = 25
2MAX_TOKENS_SPENT = 100_000
3MAX_WALL_TIME = 300  # seconds
4
5iteration = 0
6tokens_used = 0
7start_time = time.time()
8
9while True:
10    if iteration >= MAX_ITERATIONS:
11        return "Stopped: reached iteration limit"
12    if tokens_used >= MAX_TOKENS_SPENT:
13        return "Stopped: token budget exhausted"
14    if time.time() - start_time > MAX_WALL_TIME:
15        return "Stopped: timeout reached"
16
17    response = call_llm(messages, tools)
18    iteration += 1
19    tokens_used += response.usage.total_tokens
20    # ... rest of loop

Why Iteration Limits Are Non-Negotiable

Without an iteration limit, a single confused agent run can consume your entire API budget. Consider: an agent that misunderstands a task and keeps trying different approaches will call the LLM 50, 100, 200 times, each call costing tokens. At $15 per million output tokens, a 200-iteration loop with 1,000 tokens per response costs $3 for a single failed run. Multiply by hundreds of concurrent users and you have a financial incident.

The iteration limit is not a performance optimization. It is a safety mechanism. Set it based on the longest legitimate task your agent handles. If the hardest task requires 15 iterations, set the limit to 25. If a run hits 25 iterations, something has gone wrong and continuing will not fix it.

Common Pitfall

Never deploy an agent loop without an iteration limit. A missing limit turns a confused model into an unbounded cost generator. In production, also add a token budget limit and a wall-clock timeout. These three limits together prevent runaway cost, runaway computation, and stuck processes.

Choosing the Right Limits

The right limits depend on your agent's task complexity:

  • Simple Q&A agents (1-3 tool calls per task): MAX_ITERATIONS = 10
  • Research agents (5-10 tool calls): MAX_ITERATIONS = 25
  • Coding agents (10-20+ tool calls for complex tasks): MAX_ITERATIONS = 50-100
  • Token budgets: 10x your average successful run. If a typical task uses 5,000 tokens, set the budget to 50,000.
  • Timeouts: 2-3x your p95 task completion time

The key principle: limits should be generous enough that legitimate tasks never hit them, but tight enough that runaway loops are caught within seconds, not minutes.

Graceful Degradation on Limit Hit

When an agent hits a limit, do not just return an error. Return what the agent has accomplished so far:

python
if iteration >= MAX_ITERATIONS:
    summary = summarize_progress(messages)
    return f"I reached my iteration limit. Here is what I found so far: {summary}"

This gives the user partial results instead of nothing. A research agent that found 7 of 10 requested papers before hitting its limit is more useful than one that returns "Error: iteration limit exceeded."

Combining Multiple Safety Stops

In production, use all three safety mechanisms together. The iteration limit catches infinite loops. The token budget catches expensive loops (large context per iteration). The wall-clock timeout catches slow loops (tool calls that hang or network timeouts).

Each mechanism catches a different failure mode:

  • A loop that makes cheap, fast calls but never terminates: caught by iteration limit
  • A loop that makes few calls but passes huge documents each time: caught by token budget
  • A loop where a single tool call hangs for minutes: caught by timeout

No single limit covers all cases. An agent that reads a 50,000-token file on every iteration might hit the token budget in 3 iterations but would not be caught by an iteration limit of 25. An agent that makes 100 small, fast tool calls might hit the iteration limit but never approach the token budget. Use all three, and log which limit was hit. This data tells you whether you need to tune the limits or fix the underlying issue.

Every iteration of the agent loop adds to the conversation history: the LLM's response, the tool call, and the tool result. This means the state grows with every step. A 10-iteration agent run might accumulate 20,000+ tokens of history: tool results, reasoning, intermediate outputs. Managing this growing state is one of the central engineering challenges of agent development.

Five context management strategies compared

The Context Window Problem

LLMs have a finite context window: the maximum number of tokens they can process in a single call. Claude supports 200K tokens, GPT-4 supports 128K, and smaller models may support only 8K-32K. When the conversation history exceeds the context window, the LLM call fails.

Even before hitting the hard limit, performance degrades. LLMs process long contexts more slowly (higher latency per call) and may lose track of information buried deep in a long conversation (the "lost in the middle" effect). A 100K-token context that is 90% tool results from early iterations is wasteful. The model is paying attention cost for information it no longer needs.

Strategies for Managing State

Truncation: Remove the oldest messages when the history exceeds a threshold. Simple and effective for tasks where recent context matters most. The risk: the agent forgets early tool results that are still relevant.

Summarization: Periodically ask the LLM to summarize the conversation so far, then replace the full history with the summary. This compresses 10,000 tokens into 500 while preserving key facts. The risk: the summary may lose important details.

Sliding window: Keep the system prompt, the first N messages, and the last M messages. Drop everything in between. This preserves the original task and the most recent context while bounding total size. Many production agents use a window of the last 10-20 messages.

Selective retention: Tag certain messages as "important" (the original user request, key findings, error messages) and always keep them. Drop unimportant messages (intermediate tool calls that returned no useful data). This requires more logic but produces the highest-quality compressed context.

State Beyond the Message History

Not all agent state belongs in the message history. Production agents often maintain external state:

  • Task progress tracker: which subtasks are complete, which remain. Stored in a database or in-memory data structure, not in the LLM context.
  • File system state: files the agent has created or modified. The file contents do not need to stay in the message history after the write is confirmed.
  • Accumulated results: search results, computation outputs, extracted data. Store these externally and reference them by ID rather than keeping the raw data in every LLM call.

The principle: keep the LLM context focused on what the model needs to make its next decision. Move everything else to external storage and retrieve it only when needed.

Token Counting and Budget Awareness

Production agents track token usage at every iteration. Before calling the LLM, count the tokens in your message history (most provider SDKs include a tokenizer). If the count approaches the context limit, trigger one of the strategies above (summarize, truncate, or drop non-essential messages) before the call fails.

A simple budget-aware loop looks like this: set a soft limit at 75% of the context window (e.g., 96K tokens for a 128K model). When the message history exceeds the soft limit, compress it before the next LLM call. This leaves 25% headroom for the model's response and prevents the hard-limit failure that would crash the loop.

Token counting also feeds into cost tracking. If you know each iteration costs roughly 2,000 tokens, and you have a budget of 50,000 tokens, you know the agent can run for about 25 iterations before hitting the budget. This makes iteration limits and token budgets complementary: the iteration limit catches infinite loops, and the token budget catches iterations with unexpectedly large contexts.

The Scratchpad Pattern

The scratchpad is the most common external state pattern in production agents. It is a key-value store (often just a Python dictionary) where the agent writes intermediate results during execution. The scratchpad lives outside the message history, so it does not consume context tokens.

When the agent needs data from a previous step, the orchestration code retrieves it from the scratchpad and injects only the relevant portion into the next LLM prompt. For example, an agent that reads 10 files stores each file's key findings in the scratchpad under the filename. When the agent needs to compare findings across files, the orchestration code pulls the relevant entries and adds them to the system prompt, not the entire file contents.

The scratchpad also enables resumability. If an agent crashes mid-task, the scratchpad contains all progress made so far. A new agent run can load the scratchpad and continue from where the previous run stopped, without repeating expensive tool calls. This is particularly valuable for long-running agents that process dozens of files or make dozens of API calls.

Interview Tip

A common pattern is the 'scratchpad'nullan external data structure where the agent stores intermediate results. Instead of keeping a 5,000-token database query result in the message history for every subsequent LLM call, the agent writes the key findings to a scratchpad and includes only a one-line summary in the context. This keeps the context small and focused while preserving all the data the agent might need later.

Debugging an agent is fundamentally different from debugging a traditional program. A traditional program follows the same code path every time given the same input. An agent can take completely different paths on consecutive runs with identical input. The LLM might choose different tools, explore different strategies, or reason differently about the same observations. This non-determinism makes traditional debugging techniques (setting breakpoints, reading stack traces) insufficient.

Agent trace replay for debugging non-deterministic behavior

Non-Determinism Is the Core Challenge

LLMs are stochastic: even with temperature=0, the output is not strictly deterministic across different API calls due to floating point precision, batching effects, and infrastructure changes. With the default temperature (typically 0.7-1.0), the same prompt can produce meaningfully different responses.

For an agent, this means the same task can produce different tool call sequences:

Run 1: search_web → read_page → summarize → respond

Run 2: search_web → search_web (different query) → read_page → read_page → summarize → respond

Run 3: search_web → respond (decides it has enough from search snippets alone)

All three runs might produce correct answers through different paths. A bug that appears in 1 of 10 runs is much harder to find than a bug that appears in every run.

Trace Logging

The most important debugging tool for agents is a complete trace of every LLM call and tool execution. For each iteration, log:

  • The full prompt sent to the LLM (or at least the last few messages if the context is large)
  • The LLM's response (including tool calls and reasoning)
  • The tool that was executed and its arguments
  • The tool's result (or error)
  • Token counts and latency for the LLM call
  • The iteration number and elapsed time

This trace is the agent's equivalent of a stack trace. When a run fails, you replay the trace to understand the agent's decision-making process: what did it see at each step, and why did it make the choice it made?

Replay and Deterministic Testing

Save conversation histories from production runs. When a bug is reported, replay the saved history to reproduce the exact sequence of events. This is more reliable than trying to reproduce a non-deterministic bug by running the agent again.

For testing, build a replay harness that feeds pre-recorded tool results instead of calling real tools. This makes agent tests deterministic: the LLM call is mocked or cached, and the tool results are fixed. You can then assert that the agent produces the expected output given a specific sequence of tool results.

Step-Through Debugging

For complex agents, implement a step-through mode where the agent pauses after each iteration and waits for human approval before continuing. The human reviews the LLM's proposed tool call, approves or modifies it, and the agent continues.

This is invaluable during development: you watch the agent's reasoning in real time, catch bad decisions before they execute, and build intuition for how the model approaches different tasks. Many production agents include a "verbose" or "debug" mode that shows each step without requiring approval.

Common Agent Bugs

Infinite tool call loops: The agent calls the same tool with the same arguments repeatedly. Usually caused by the tool returning a result that the model does not understand, so it tries again. Fix: improve the tool's error messages or add loop detection that identifies repeated identical calls.

Premature stopping: The agent returns a text response before completing the task. Often caused by ambiguous instructions or a context that overwhelms the model. Fix: make the system prompt explicit about what "done" means and test with varied task descriptions.

Tool argument errors: The agent calls a tool with wrong argument types or missing required fields. This is a structured output problem. The next lesson covers it in depth.

Context amnesia: After many iterations, the agent "forgets" earlier context due to attention dilution in long contexts. Fix: use the context management strategies from the previous section: summarize and pin important information.

Wrong tool selection: The agent picks a tool that cannot accomplish the subtask. For example, it calls a web search tool when the data is in a local database. This usually indicates that the system prompt's tool descriptions are unclear. Fix: improve tool descriptions to specify when each tool should be used and what data it can access.

Building an Evaluation Harness

Testing individual agent runs is not enough. You need to evaluate the agent's behavior across a distribution of tasks. An evaluation harness runs the agent on a suite of test cases and measures success rate, average iterations, average cost, and failure modes.

Each test case specifies: the user input, the expected outcome (a correct answer, a set of tool calls that should be made, or a condition the final output must satisfy), and optionally the maximum iterations and token budget. The harness runs the agent, compares the output to the expected outcome, and reports pass/fail with the execution trace.

Evaluation harnesses catch regressions. When you change a system prompt, add a tool, or update the model version, the harness reveals whether the change improved or degraded the agent across the full test suite. Without it, you discover regressions from user complaints, which is slower, more expensive, and more damaging.

A good test suite covers edge cases: tasks that require many iterations, tasks where the first tool call fails, tasks with ambiguous instructions, tasks that should be refused (out-of-scope requests). Run the suite on every prompt change and every model upgrade. Treat the evaluation harness as your agent's unit test suite.

Observability in Production

Beyond traces and evaluations, production agents need real-time observability. Track these metrics on a dashboard:

  • Success rate: percentage of agent runs that produce a correct or acceptable result. This is your top-line metric.
  • Average iterations per run: a rising average often signals that the model is struggling, the tools are returning less useful results, or the task distribution has shifted.
  • P95 latency: the time from user request to final response. Includes all LLM calls, tool executions, and retries. A spike in P95 points to slow tools or degraded model performance.
  • Token cost per run: track the mean and P95. Sudden increases indicate that the model is using more context per call (possibly reading larger tool results).
  • Limit hit rate: how often agents hit iteration, token, or timeout limits. A rising rate means agents are struggling to complete tasks within their budgets.

Alert on anomalies in these metrics. A 10% drop in success rate or a 2x increase in average iterations is a signal to investigate: check recent prompt changes, model version updates, or changes to tool behavior.