Evaluating Agent Performance

Topics Covered

Why agent evaluation is hard

Non-Determinism

Open-Ended Outputs

Compound Errors

Subjective Quality

The Evaluation Spectrum

Task completion metrics

Task Completion Rate

Step Efficiency

Token Cost Per Task

Latency

Success Rate by Task Category

Quality and accuracy metrics

Accuracy

Completeness

Relevance

Coherence and Style

Combining Quality Dimensions

Human Evaluation as Gold Standard

Automated evaluation with LLM judges

How LLM-as-Judge Works

Rubric Design Matters

Reference-Based vs Reference-Free Judging

Pitfalls of LLM Judges

Calibrating Your Judge

Building evaluation pipelines

Evaluation Datasets

Three Evaluation Dimensions

Regression Detection

Key Benchmarks and Tools

Building Your First Eval Dataset

You cannot improve what you cannot measure. But measuring agent performance is fundamentally harder than measuring traditional software. A REST API returns 200 or 500nullpass or fail. An agent that summarizes a document, writes code, or plans a trip produces open-ended output where "correct" has no single right answer. Evaluating an agent requires fundamentally different tools and techniques than evaluating deterministic software: techniques designed for ambiguity, non-determinism, and subjective quality. This lesson covers why, and what to use instead.

Four challenges of agent evaluation: non-determinism, open-ended outputs, cost, and trajectory diversity

Non-Determinism

Run the same agent on the same input twice and you get different outputs. The LLM uses sampling (temperature, top-p) to generate tokens, so the reasoning path, tool selection, and final answer vary across runs. A traditional unit test asserts exact equality. Agent evaluation must assess quality across a distribution of possible outputs. Is this answer in the set of acceptable answers, even if it differs from the reference?

Open-Ended Outputs

Ask an agent to "research competitors for a SaaS startup" and there are hundreds of valid responses. The agent might focus on pricing, features, market share, or all three. It might produce a table, a narrative, or bullet points. Each format could be equally valuable to the user. Evaluation must handle this ambiguity, testing whether the output is useful, not whether it matches a single expected string.

Compound Errors

Agents execute multi-step plans. A coding agent reads a file, identifies the bug, writes a fix, and runs tests. If step 1 misidentifies the bug, every subsequent step is wasted. The agent confidently fixes the wrong thing and reports success. Traditional metrics like "did the tests pass?" miss this entirely. You need to evaluate intermediate reasoning, not just the final output.

Subjective Quality

Is this email draft "good"? Is this code refactoring "clean"? Quality judgments depend on context, audience, and preferences. Two human reviewers often disagree on quality scores by 10-20%. Any automated evaluation must acknowledge this inherent subjectivity and measure consistency rather than chasing a single "correct" score.

The Evaluation Spectrum

These challenges create a spectrum of evaluation difficulty. At the easy end, a coding agent's output can be verified by running tests: either the code works or it does not. In the middle, a research agent's output can be checked against known facts, but completeness and relevance require judgment. At the hard end, a creative writing agent's output is almost entirely subjective. Effective agent evaluation requires matching the evaluation method to the task's position on this spectrum: automated checks for the objective dimensions, LLM judges for the semi-objective ones, and human review for the subjective ones.

Key Insight

Agent evaluation is closer to grading an essay than checking a test result. There is no single right answer, quality is multidimensional (accuracy, completeness, style, efficiency), and reasonable evaluators can disagree. This is why layered evaluation (automated metrics for the objective dimensions, LLM judges for the subjective ones, humans for calibration) is the standard approach.