Structured Outputs and Reliable Extraction

Introduction to Agentic AI

Structured Outputs and Reliable Extraction

Topics Covered

Why Structured Outputs Matter

The Parsing Problem

Where Structured Outputs Are Needed

The Reliability Spectrum

The Cost of Parsing Failures

JSON Mode and Schema Enforcement

JSON Mode

Schema Enforcement (Structured Outputs)

Provider Implementations

Limitations of Schema Enforcement

Schema Design Best Practices

Parsing and Validation Strategies

The Validation-Retry Loop

Defensive Parsing

XML and Markdown as Alternatives

When to Skip Structured Output Entirely

Handling Malformed Outputs

Types of Malformed Output

Recovery Strategies

Output Grounding

Monitoring Extraction Quality

End-to-End Pipeline Design

LLMs produce text. Agent orchestration code consumes data structures. This gap is the source of a huge class of agent failures: the model returns a response that looks right to a human but cannot be parsed by the code that needs to act on it. A missing closing brace in a JSON response, a field name that does not match the expected schema, or a string where the code expects a number. Any of these breaks the agent loop.

The Parsing Problem

Consider an agent that needs to decide which tool to call. The orchestration code expects a structured response like:

json

1{
2  "tool": "search_database",
3  "arguments": {
4    "query": "SELECT * FROM users"
5  }
6}

But the LLM might return:

1
2I'll search the database for you. Here's my query:
3search_database(query="SELECT * FROM users")
4

The second response contains the same information but in a format that json.loads() cannot parse. The agent loop crashes with a JSONDecodeError, and the user sees an error instead of a result. The LLM did the right thing (it chose the right tool with the right arguments) but expressed it in a way the code cannot understand.

This is why structured outputs are not a convenience. They are a reliability requirement. Every point in the agent loop where code needs to parse LLM output is a potential failure point, and structured output mechanisms exist to eliminate those failures.

Where Structured Outputs Are Needed

In an agent system, structured outputs appear at several points:

Tool calls: The model must specify the tool name and arguments in a parseable format. This is the most critical use case and is handled natively by most LLM providers via their tool calling API.
Routing decisions: When an agent must choose between multiple strategies ("search the web" vs "query the database" vs "ask the user for clarification"), the decision must be extractable as a discrete choice, not a paragraph of reasoning.
Data extraction: The agent reads a document and must extract specific fields (name, date, amount) into a structured format for downstream processing. A free-text summary is useless if the accounting system needs a JSON object with exact field names.
Classification and labeling: The agent categorizes support tickets, labels sentiment, or assigns priority levels. The output must be one of a known set of values, not a creative rephrasing.

Key Insight

Tool calling APIs from LLM providers (Claude, GPT, Gemini) are themselves a structured output mechanism. When you define tools with schemas, the model returns tool calls as structured JSON that your code can parse reliably. The 'structured output problem' is most acute when you need structured data outside the tool calling framework: extracting data from documents, making routing decisions, or producing structured final responses.

The Reliability Spectrum

Not all structured output approaches are equally reliable:

Prompt-based ("Please respond in JSON format"): least reliable. The model often wraps JSON in markdown code fences, adds explanatory text before or after, or produces invalid JSON. Works for prototyping but breaks in production.
JSON mode: moderately reliable. The provider constrains the model to output valid JSON, but does not enforce a specific schema. You get valid JSON but the fields and types might not match what you expect.
Schema-enforced: most reliable. The provider constrains the model to output JSON that matches a specific schema (required fields, types, enums). If the model tries to produce non-conforming output, the provider's sampling logic steers it back to compliance.

Each step up the reliability spectrum reduces parsing failures but may add constraints on which models and providers you can use. The right choice depends on how critical the parsing reliability is for your application.

The Cost of Parsing Failures

A parsing failure does not just return an error. It wastes the tokens spent generating the response and forces a retry that doubles the cost and latency of that extraction. In a pipeline processing thousands of documents, a 5% parse failure rate means 5% of your LLM spend is wasted on responses that cannot be used.

Worse, silent parsing failures corrupt downstream data. A lenient parser that accepts {"vendor": "Acme"} when the schema expects {"vendor_name": "Acme"} produces a record with a missing vendor_name field. If the downstream system does not validate, this corrupt record propagates. An invoice with no vendor name gets processed, a report shows blank fields, or a database constraint violation crashes a batch job hours later.

The cost of moving from prompt-based ("please respond in JSON") to schema-enforced structured outputs is minimal: a few lines of code to define the schema. The cost of not doing it is unpredictable failures in production that are hard to reproduce and harder to fix after they have corrupted downstream data.

JSON Mode and Schema Enforcement

JSON mode and schema enforcement are provider-level mechanisms that constrain the LLM's output at the token generation level. Instead of hoping the model follows a prompt instruction, these mechanisms guarantee structural compliance by modifying the sampling process itself.

JSON Mode

JSON mode tells the provider: "only generate tokens that produce valid JSON." The model can still choose any JSON structure (any keys, any nesting, any values) but the output is guaranteed to be parseable by json.loads().

How it works internally: at each token generation step, the provider masks (sets probability to zero) any token that would break JSON syntax. If the model has generated {"name": "Alice", the next token must continue valid JSON, it cannot be a bare word, a markdown header, or explanatory text. This constraint is invisible to the model's reasoning but absolute in its enforcement.

JSON mode solves one problem completely (syntactic validity) but leaves another unsolved: you get valid JSON, but the keys, types, and structure are up to the model. It might return {"user_name": "Alice"} when you expected {"name": "Alice"}.

Schema Enforcement (Structured Outputs)

Schema enforcement goes further: you provide a JSON Schema that specifies exact field names, types, required fields, and allowed values. The provider constrains generation to produce output that validates against your schema.

json

1{
2  "type": "object",
3  "required": ["name", "age", "role"],
4  "properties": {
5    "name": {"type": "string"},
6    "age": {"type": "integer"},
7    "role": {"type": "string", "enum": ["admin", "user", "viewer"]}
8  },
9  "additionalProperties": false
10}

With this schema, the model must include "name", "age", and "role" with the correct types. The "role" field must be one of the three enum values. No extra fields are allowed. The provider's sampling logic enforces all of these constraints at the token level.

Provider Implementations

Different providers implement structured outputs differently:

Anthropic (Claude): Tool use with input schemas provides schema enforcement for tool call arguments. For general structured output, Claude supports a dedicated tool-based pattern where you define a "response" tool with the desired schema, and the model returns its answer as tool arguments.

OpenAI: response_format: { type: "json_schema", json_schema: {...} } provides full schema enforcement. JSON mode (without schema) is also available via response_format: { type: "json_object" }.

Google (Gemini): Supports schema enforcement via response_schema in the generation config. Defines the output structure using a subset of OpenAPI schema.

Limitations of Schema Enforcement

Schema enforcement guarantees format, not correctness. The model will produce a JSON object that matches your schema, but the values might be hallucinated, wrong, or nonsensical. A schema requiring "age": integer prevents the model from returning "age": "twenty-five", but it cannot prevent the model from returning "age": 250.

Common Pitfall

Schema enforcement guarantees structure, not truth. A model constrained to output {'sentiment': enum['positive','negative','neutral']} will always return one of those three values, but it might return 'positive' for a clearly negative review if the text confuses it. Schema enforcement eliminates parsing failures but not reasoning failures. You still need validation logic that checks whether the values make sense.

Recursive or deeply nested schemas can also cause issues. Some providers limit schema depth or complexity. Very long enum lists (100+ values) may degrade model performance because the constraint space is large. Keep schemas as simple as your use case allows.

Schema Design Best Practices

A well-designed schema balances precision with flexibility:

Flat over nested: Prefer a flat object with clear field names over deeply nested structures. The model produces flat JSON more reliably than three-level-deep nesting.
Nullable fields for optional data: If a field might not be extractable from the source, make it nullable rather than required. This prevents the model from hallucinating values for fields it genuinely cannot determine.
Enums for known categories: When the output must be one of a fixed set of values, use enum constraints. This eliminates spelling variations, casing differences, and creative rephrasing.
String over number for ambiguous data: If the source might say "approximately 500" or "500-600," using a string field lets the model preserve the original phrasing. An integer field forces it to pick one number, losing nuance.
Add description fields to the schema: Many providers pass field descriptions to the model during generation. A description like "The total invoice amount in USD, as a decimal number" guides the model more effectively than the field name "amount" alone.
Version your schemas: When extraction requirements change, version the schema rather than modifying it in place. This lets you compare extraction quality across schema versions and roll back if a new schema causes regressions.
Test with adversarial inputs: Before deploying a schema, test it with inputs that are ambiguous, missing expected data, or contain unexpected formats. These edge cases reveal schema design flaws before they cause production failures. Run a batch of 50-100 diverse inputs through the schema and measure field-level accuracy before deploying.

Parsing and Validation Strategies

Even with JSON mode or schema enforcement, production agents need a parsing and validation layer. Schema enforcement guarantees structure but not correctness. JSON mode guarantees syntax but not structure. And many agents work with models or providers that do not support either feature, requiring robust parsing of free-text output.

The Validation-Retry Loop

The most effective pattern for reliable structured extraction is: generate, validate, retry on failure.

python

1def extract_with_retry(prompt, schema, max_retries=3):
2    messages = [{"role": "user", "content": prompt}]
3
4    for attempt in range(max_retries):
5        response = call_llm(messages)
6        parsed = try_parse_json(response.text)
7
8        if parsed is None:
9            messages.append({"role": "assistant", "content": response.text})
10            messages.append({"role": "user", "content":
11                f"Your response was not valid JSON. Please respond "
12                f"with only a JSON object matching this schema: {schema}"
13            })
14            continue
15
16        errors = validate_against_schema(parsed, schema)
17        if errors:
18            messages.append({"role": "assistant", "content": response.text})
19            messages.append({"role": "user", "content":
20                f"Your JSON has validation errors: {errors}. "
21                f"Please fix these errors and respond with the corrected JSON."
22            })
23            continue
24
25        return parsed  # Valid!
26
27    raise ExtractionError("Failed after max retries")

The key insight: feeding the error message back to the model is remarkably effective. LLMs are good at correcting their own mistakes when told specifically what went wrong. "Missing required field 'due_date'" gives the model enough information to fix the output in one retry. Vague feedback like "invalid output" does not.

Defensive Parsing

Before validation, you need to extract JSON from a response that might contain extra text. Common patterns:

python

1def try_parse_json(text):
2    # Try direct parse first
3    try:
4        return json.loads(text)
5    except json.JSONDecodeError:
6        pass
7
8    # Try extracting from markdown code fence
9    match = re.search(r'```(?:json)?\s*(.*?)\s*```', text, re.DOTALL)
10    if match:
11        try:
12            return json.loads(match.group(1))
13        except json.JSONDecodeError:
14            pass
15
16    # Try finding first { to last }
17    start = text.find('{')
18    end = text.rfind('}')
19    if start >= 0 and end > start:
20        try:
21            return json.loads(text[start:end+1])
22        except json.JSONDecodeError:
23            pass
24
25    return None

This layered approach handles the most common LLM output patterns: raw JSON, JSON in code fences, and JSON surrounded by explanatory text. Each layer is a fallback for when the previous one fails.

XML and Markdown as Alternatives

JSON is not always the best structured format for LLM output. XML and markdown have advantages in certain scenarios:

XML: More verbose but the model rarely produces malformed XML because opening and closing tags provide clear structure. XML is particularly good for nested content extraction where the model needs to preserve hierarchy. Claude natively supports XML-style tags in prompts and responses.

Markdown: For semi-structured output (a report with sections, a comparison with bullet points), markdown is more natural for the model to produce than JSON. Parse it with a markdown parser and extract sections by heading. Less rigid than JSON but sufficient when the structure is simple.

The trade-off: JSON is best for machine consumption, XML for hierarchical content, and markdown for human-readable semi-structured output. Use whichever format the model produces most reliably for your specific task.

When to Skip Structured Output Entirely

Not every agent output needs structured extraction. If the agent's final response is shown directly to a human (a chatbot answer, a summary, a recommendation), free text is appropriate. Forcing the agent to produce JSON when the output is displayed as text adds complexity without benefit.

Structured output is essential at machine-to-machine boundaries, where your code parses the LLM's response to make a decision, call a tool, or write to a database. At human-facing boundaries, let the model produce natural language. The common mistake is using structured output everywhere (including user-facing responses) or nowhere (including machine-facing interfaces). Match the output format to the consumer. Save structured output for the places where it matters (tool calls, data extraction, routing decisions) and let the model speak naturally everywhere else.

Interview Tip

When building a validation-retry loop, always include the specific validation error in the retry prompt. 'Missing required field: due_date' is far more effective than 'Your output was invalid.' LLMs excel at targeted corrections when told exactly what to fix. Most models correct schema violations in a single retry when given specific error messages.

Handling Malformed Outputs

Even with schema enforcement and validation-retry loops, agents encounter malformed outputs in production. Handling these failures gracefully (without crashing, losing progress, or producing corrupt data) is what separates a prototype from a production system.

Types of Malformed Output

Syntactic failures: Invalid JSON (missing quotes, trailing commas, unescaped characters). These are caught by json.loads() and are the easiest to handle. JSON mode eliminates them entirely.

Structural failures: Valid JSON but wrong structure. Missing required fields, wrong types ("42" as string instead of integer 42), extra fields that confuse downstream consumers. Schema enforcement eliminates most of these, but not all: nullable fields that should have values, or arrays that should have at least one element.

Semantic failures: Correct structure, wrong content. The model returns {"sentiment": "positive"} for a clearly negative review, or {"amount": 0.0} when the invoice clearly shows $1,500. These are the hardest to detect because the output passes all structural validation. Detecting semantic failures requires domain-specific validation rules or human review.

Partial outputs: The model produces a truncated response due to hitting the max token limit. The JSON is cut off mid-string or mid-object. Common with large extraction tasks where the output exceeds the model's maximum output token count.

Recovery Strategies

Retry with error feedback: The primary strategy for syntactic and structural failures. Feed the specific error back to the model and ask it to correct. Typically resolves the issue in 1-2 retries. Set a max retry count (3 is standard) to prevent infinite retry loops.

Fallback to a simpler format: If JSON extraction fails repeatedly, ask the model to output in a simpler format: a bulleted list, key-value pairs separated by colons, or XML tags. Simpler formats have fewer structural requirements and fewer points of failure. Parse the simpler format and convert to your target structure in code.

Chunked extraction for partial outputs: If the model's response is truncated because the output is too large, split the task. Instead of extracting all 50 invoice line items in one call, extract items 1-10, then 11-20, and so on. Merge the chunks in code. This ensures each extraction fits within the output token limit.

Default values for missing fields: When a field is missing and retries have failed, use a sensible default rather than crashing. Mark the default as unverified so downstream consumers know it was not extracted from the source. This is better than blocking the entire pipeline for one missing field.

Escalation to human review: For semantic failures that automated validation cannot catch, route the output to a human reviewer. Flag low-confidence extractions (where the model itself expressed uncertainty or where values seem implausible) for review. This creates a human-in-the-loop safety net without requiring humans to review every extraction.

Output Grounding

Output grounding asks the model to cite its sources: to include the exact text it extracted a value from. This serves two purposes: it reduces hallucination (the model is more accurate when it knows it must justify its answer with a direct quote), and it makes errors detectable (if the source_quote does not contain the extracted value, the extraction is likely wrong).

A grounded extraction schema looks like:

json

1
2{
3  "vendor_name": "Acme Corp",
4  "vendor_name_source": "Invoice from Acme Corp, dated...",
5  "amount": 1500.00,
6  "amount_source": "Total Due: $1,500.00"
7}
8

If the "amount" is 1500.00 but the "amount_source" says "Subtotal: $1,200.00", you know the extraction is wrong. This pattern turns every extraction into a verifiable claim with an audit trail.

Monitoring Extraction Quality

Track these metrics in production:

Parse success rate: percentage of LLM responses that parse successfully on the first attempt (target: 95%+ with JSON mode, 99%+ with schema enforcement)
Retry rate: how often the validation-retry loop needs more than one attempt (high retry rate means the prompt or schema needs improvement)
Field-level accuracy: for each extracted field, what percentage of values are correct? This requires ground truth data (human-verified extractions) and periodic sampling
Truncation rate: how often the model's response is cut off by the token limit (high rate means the extraction task is too large for a single call)

These metrics tell you where to invest: if parse success rate is low, improve the structured output mechanism. If retry rate is high, improve the prompt or schema. If field-level accuracy is low, the model needs better instructions or the task needs domain-specific validation.

End-to-End Pipeline Design

A production extraction pipeline combines all the strategies in this section into a layered defense:

First attempt: Use schema enforcement if available. This eliminates syntactic and structural failures at the source.
Defensive parsing: If schema enforcement is not available, use the layered JSON parser (direct parse, code fence extraction, brace extraction) to handle common output formats.
Validation: Check the parsed output against the schema. Verify required fields, correct types, enum values, and any domain-specific rules (amounts must be positive, dates must be in the future).
Retry with feedback: If validation fails, feed the specific error back to the model and retry. Most failures resolve in 1-2 retries.
Fallback: If retries are exhausted, try a simpler extraction format (XML tags, key-value pairs) or split the task into smaller chunks.
Escalation: If all automated strategies fail, route to human review. Log the failure for analysis. Recurring failures on specific document types indicate a prompt or schema improvement is needed.
Monitoring: Track success rates, retry rates, and field accuracy. Alert when metrics drop below thresholds. Use the data to continuously improve prompts, schemas, and validation rules.

Each layer catches failures that the previous layer missed. The result is a pipeline where 99%+ of documents are processed automatically, the remaining 1% are caught by fallbacks or escalated to humans, and zero corrupt records reach downstream systems.

Course

Introduction to Agentic AI

LLM Foundations

The Agent Paradigm

Reasoning and Planning

Memory and Knowledge

Agent Architectures

Safety and Reliability

Production Engineering

Real-World Agent Patterns