Observability and Debugging

Topics Covered

Tracing agent execution

Traces, Spans, and Events

What Makes Agent Traces Different

Implementing Tracing in Practice

Trace Sampling and Retention

Correlating Traces with External Events

Logging decisions and tool calls

What to Log

What NOT to Log

Structured Logging

Log Aggregation and Retention

Privacy and Compliance Considerations

Visualizing agent trajectories

The Trajectory View

Tree View for Decision Branching

Comparison Views

Dashboards for Aggregate Patterns

Alerting on Trajectory Anomalies

Debugging non-deterministic systems

Reproducing Failures

Flaky Tests and Non-Deterministic Failures

Common Failure Patterns

The Debugging Workflow

Building a Debugging Culture

Monitoring and Alerting Strategy

When a traditional program fails, you read the stack trace. It shows exactly which function called which, what arguments were passed, and where the exception was thrown. When an agent fails, you need something different entirely.

The agent made a sequence of decisions: it read a file, decided to search for something, chose a tool, interpreted the result, and then made another decision based on that interpretation. A stack trace cannot capture this because the "logic" is not in the code. It is in the LLM's reasoning. The code just runs the agent loop. The intelligence (and the bugs) are in the LLM calls that decide what to do at each step.

This lesson covers how to build the observability infrastructure that makes agent failures visible, traceable, and fixable, from individual trace capture to team-wide debugging workflows.

Agent tracing captures this decision chain. Every agent run produces a trace: a structured record of everything the agent did, why it did it, and what happened as a result. Without tracing, debugging an agent failure is like debugging a web application with no server logs. You know it failed, but you have no idea what path it took to get there.

Agent Trace Visualization

Traces, Spans, and Events

A trace represents one complete agent run: from receiving the task to producing the final output. Think of it as the top-level container that holds everything the agent did for one task. Every trace gets a unique trace ID that links all the operations within that run.

A span is a single operation within a trace. Each LLM call is a span. Each tool call is a span. Each decision point is a span. Spans nest hierarchically: the trace contains an LLM span, which triggers a tool call span, which contains a sub-span for the actual API request. This nesting forms a tree that mirrors the agent's execution flow.

Each span records:

  • Start and end time: duration tells you where time is spent
  • Input and output: what went into the operation and what came out
  • Metadata: model name, token counts, cost, tool name, error status
  • Parent span ID: links this span to its parent in the tree

An event is a point-in-time annotation attached to a span. "Agent decided to switch from search to edit" is an event. "Agent detected loop and changed strategy" is another. Events capture decision moments that fall between tool calls: the reasoning that connects one action to the next.

Here is what a typical agent trace looks like in practice. The root span is the agent run. Its children are the iteration spans, one per loop iteration. Each iteration span has children for the LLM call and any resulting tool calls:

 
1Agent Run (trace_id: abc-123, duration: 4200ms)
2├── Iteration 1 (span_id: iter-1, 1200ms)
3│   ├── LLM Call (model: gpt-4o, tokens: 3200/150, 800ms)
4│   └── Tool: file_read (path: /src/main.py, 400ms)
5├── Iteration 2 (span_id: iter-2, 1500ms)
6│   ├── LLM Call (model: gpt-4o, tokens: 5100/280, 900ms)
7│   └── Tool: file_edit (path: /src/main.py, 600ms)
8└── Iteration 3 (span_id: iter-3, 1500ms)
9    ├── LLM Call (model: gpt-4o, tokens: 5800/120, 700ms)
10    └── Tool: terminal (cmd: npm test, 800ms)

Reading this trace tells you: the agent read a file, edited it, and ran tests. It consumed 14,100 input tokens and 550 output tokens across 3 LLM calls. The most expensive operation was the second LLM call (5,100 input tokens, the growing context from accumulated results). The total duration was 4.2 seconds, with LLM inference accounting for 2.4 seconds and tool execution for 1.8 seconds.

This is the kind of insight that tracing provides and stack traces cannot. You can immediately see where time and tokens are spent, whether the agent's approach was efficient, and which step to optimize for the biggest impact.

What Makes Agent Traces Different

Application performance monitoring (APM) tools like Datadog and New Relic trace HTTP requests through microservices. Agent traces look similar but capture fundamentally different information. An HTTP trace shows data flow through deterministic code paths. An agent trace shows reasoning flow through non-deterministic decision paths. The same agent on the same input might produce a different trace each time: different tools called, different reasoning, different order of operations.

This means you cannot write assertions like "this span should always appear." Instead, you analyze trace patterns: does the agent typically call the search tool first? How many LLM calls does a successful run take versus a failed one? Trace analysis for agents is statistical, not deterministic.

Implementing Tracing in Practice

Most agent tracing uses OpenTelemetry-compatible libraries or specialized tools like Langfuse and Arize. The implementation pattern is consistent: wrap each LLM call and tool call in a span, attach metadata (model, tokens, cost), and export the trace to a collector. The trace ID propagates through the entire agent run, linking all spans into a single tree.

A minimal tracing setup needs three components: (1) a trace collector that receives and stores spans, (2) instrumentation in the agent code that creates spans for each operation, and (3) a query interface that lets you search and visualize traces. You can start with console output for development and graduate to a full tracing backend for production.

Instrumenting your agent code is straightforward. Wrap each LLM call in a span creation block that records the start time, captures the input/output, and logs the metadata when the call completes. Do the same for tool calls. Most agent frameworks (LangChain, CrewAI, custom loops) have natural hook points where you can insert span creation without modifying the core agent logic. If you are building a custom agent loop, add tracing at the loop level: one span per iteration, with child spans for the LLM call and tool calls within that iteration.

The overhead of tracing is minimal: a few milliseconds per span for serialization and network send, which is negligible compared to the hundreds of milliseconds spent on LLM calls. The storage cost scales with the number of agent runs and the verbosity of span data. At thousands of runs per day, budget for a few GB of trace storage daily. The debugging value of full tracing far exceeds this storage cost. A single production incident debugged in minutes instead of hours justifies months of trace storage.

Trace Sampling and Retention

In high-volume production systems, storing every trace at full fidelity is expensive. Use a sampling strategy: store 100% of failed traces (these are the ones you debug), 100% of traces exceeding a cost threshold (these are the outliers you investigate), and a random sample (10-20%) of successful traces (these establish the baseline). Successful traces that are sampled are stored at full fidelity; the rest are stored as summary records (trace ID, duration, outcome, cost, step count) that let you compute aggregate metrics without storing every span.

Retention follows a similar logic. Keep failed traces for 90 days, you may need to investigate an issue weeks after it occurred. Keep sampled successful traces for 14 days, they establish recent baselines. Keep summary records indefinitely for trend analysis. This tiered approach keeps storage costs manageable while preserving full debugging capability for the traces that matter.

Correlating Traces with External Events

Agent behavior changes often correlate with external events that the agent team does not directly control. A model provider updates GPT-4o's weights. Agent behavior shifts. An API endpoint the agent calls changes its response format (tool results change. A team member updates the system prompt) agent reasoning changes.

Annotate your trace timeline with deployment markers (when the prompt, model, or tool definitions changed) and external event markers (model provider updates, API changes). When you see a quality regression in the trace dashboard, overlaying these markers immediately shows whether the regression correlates with a known change. Without markers, you spend hours hunting for the cause of a regression that a 5-second visual correlation would reveal.

Build an integration between your deployment pipeline and your trace system. Every prompt change, model version update, or tool definition modification should automatically create an annotation in the trace timeline. This is cheap to implement and invaluable for debugging.

Model provider updates are particularly tricky because they happen without your knowledge. Subscribe to provider changelogs (OpenAI's model updates page, Anthropic's release notes) and create manual annotations when updates are announced. Some teams run a daily "canary evaluation"nulla fixed set of 10 tasks that the agent runs every day at the same time. If canary scores drop, something changed externally. This early warning system catches model regressions before they affect production traffic.

Key Insight

Agent traces capture decisions, not just data flow. A traditional trace shows what happened. An agent trace must also show why it happened: what the LLM reasoned, why it chose tool A over tool B, and how the tool result changed its next decision. Without the 'why,' you can see the failure but cannot diagnose the cause.