Error Handling and Recovery

Topics Covered

Error Taxonomy for Agents

Tool Errors

LLM Errors

Logic Errors

Resource Errors

Retry and Fallback Strategies

Retry With Context

Structured Error Feedback

Fallback Strategies

Retry Budgets

Checkpoint and Resume

What to Checkpoint

When to Checkpoint

Resume Logic

Idempotent Steps

Graceful Degradation

Complete What You Can

Report What Remains

Dead Letter Handling

Traditional software fails in predictable ways: a function throws an exception, a database query times out, a network request returns a 500. You can enumerate the failure modes and write handlers for each one. Agents fail in all of those ways plus entirely new categories: the LLM generates unparseable output, hallucinates a tool that does not exist, gets stuck in a reasoning loop, or silently misunderstands the task. Robust error handling for agents requires a taxonomy that covers both the familiar failures and the novel ones.

Agent error taxonomy covering tool errors, LLM errors, and orchestration errors

Tool Errors

Tool errors are the most straightforward category because they behave like traditional software failures. The API returns an error code. The file does not exist. The database connection times out. The external service rate-limits the request.

What makes tool errors different in an agent context is that the agent can adapt. In traditional software, a failed API call triggers a retry or a fallback handler written by the developer. An agent can read the error message, reason about what went wrong, and try a different approach: a different file path, a modified query, or an alternative tool entirely.

 
1Tool error examples:
2- 404 Not FoundAgent searches for the correct path
3- 429 Rate LimitedAgent waits and retries
4- 403 ForbiddenAgent reports permission issue to user
5- TimeoutAgent tries a simpler query

LLM Errors

LLM errors are unique to agent systems. The model produces output that your system cannot process. This includes malformed JSON in tool calls (missing quotes, trailing commas), hallucinated tool names that do not exist in the tool registry, refusal to perform a task the model considers unsafe, and responses that ignore the requested format entirely.

These errors are particularly insidious because the model does not know it has erred. A hallucinated tool name looks perfectly valid to the model. It generated a tool call with arguments and expects a result. Your system must detect the error, explain what went wrong, and give the model a chance to correct itself.

 
1LLM error examples:
2- Malformed JSONParse error, ask model to regenerate
3- Hallucinated tool    → "Tool 'search_web' not found.
4                          Available tools: web_search,
5                          file_read, database_query"
6- Format violation     → "Expected JSON, received plain text.
7                          Please respond in the required format."
8- RefusalLog and escalate to human
Key Insight

The most effective error messages for agents are structured the same way you would explain an error to a junior developer: what happened, what was expected, and what the available options are. Telling an agent 'Tool not found' is unhelpful. Telling it 'Tool search_web not found. Available tools are: web_search, file_read, database_query. Did you mean web_search?' gives the agent enough context to self-correct on the next iteration.

Logic Errors

Logic errors are when the agent does the wrong thing without any technical failure. The agent misunderstands the task and solves the wrong problem. The agent enters an infinite loop, repeating the same action because it does not recognize that it already tried it. The agent takes a correct but catastrophically inefficient approach: reading every file in a repository instead of searching for the relevant one.

Logic errors are the hardest to detect because everything looks correct at the technical level. Tool calls succeed. Output is well-formatted. But the result is wrong. Detection requires either human review, output validation against expected criteria, or self-evaluation where the agent reviews its own work.

Resource Errors

Resource errors occur when the agent exceeds system limits. The context window fills up (too many tokens). The cost budget is exhausted. The execution time exceeds the timeout. The maximum iteration count is reached.

These are guardrail errors: they exist to prevent runaway agents from consuming unbounded resources. The correct response is not to retry (the same action will hit the same limit) but to stop, summarize what was accomplished, and report what remains incomplete.