Cost and Latency Optimization

Topics Covered

Token Economics

Input vs Output Token Costs

Why Agents Are Expensive

Context Growth Problem

Model Routing

How Model Routing Works

Plan-and-Execute Pattern

Router Implementation Strategies

When Routing Fails

Caching Strategies

Provider-Level Prompt Caching

Tool Result Caching

Semantic Caching

Context Minimization

Streaming and Perceived Latency

How Streaming Works

When Streaming Helps Agents

Streaming With Tool Calls

Batching and Parallelism

Parallel Tool Calls

Request Batching

Async Processing for Background Tasks

Agent context growing across iterations with accelerating cost curve

Every time an agent runs, it spends money. Not CPU-hours or storage bytes, tokens. An LLM API call charges for both input tokens (the prompt you send) and output tokens (the response the model generates). This is fundamentally different from traditional software, where compute cost scales with requests but not with the content of each request. With agents, a single run can cost $0.01 or $5.00 depending on how many loop iterations it takes and how much context it carries.

Input vs Output Token Costs

Output tokens are 3-5x more expensive than input tokens across all major providers. This is because generating each output token requires a full forward pass through the model, while input tokens are processed in parallel during a single prefill step. For a typical agent call with a 4,000-token system prompt and a 500-token response, the input cost dominates in absolute terms, but the output tokens cost more per token.

 
1Example pricing (illustrative):
2Input:  $3.00 per million tokens
3Output: $15.00 per million tokens
4
5Single agent call:
6  Input:  4,000 tokens × $3.00/M  = $0.012
7  Output: 500 tokens × $15.00/M   = $0.0075
8  Total:  $0.0195 per call

Why Agents Are Expensive

A simple chain (prompt in, response out) makes one API call. An agent running a ReAct loop might make 5-20 calls to complete a task. Each call carries the full system prompt, the conversation history so far, and all tool definitions. As the conversation grows, so does the input token count. By iteration 10, the agent might be sending 15,000 input tokens per call, and 80% of that is the same context it sent in iteration 1.

ScenarioCallsAvg TokensTotal Tokens
Simple chain14,5004,500
Agent (10 iterations)108,00080,000
Agent (20 iterations)2012,000240,000

That 50x difference between a simple chain and a 20-iteration agent run is real. A task that costs $0.02 as a chain costs $1.00 as a complex agent. At 10,000 tasks per day, that is the difference between $200/day and $10,000/day.

Key Insight

The biggest cost driver in agent systems is not the model you choose, it is the number of loop iterations. A 10-iteration agent run on a cheap model can cost more than a single call to the most expensive model. The first optimization is always reducing iteration count: better prompts, fewer retries, smarter stopping conditions.

Context Growth Problem

Each iteration appends the previous tool call and its result to the conversation. By iteration 10, the agent carries the entire history of what it has done. This creates a compounding cost: iteration 1 sends 4,000 tokens, iteration 2 sends 5,200, iteration 3 sends 6,400, and so on. The total cost is not N times the per-call cost, it is the sum of a growing series.

The practical fix is context management. Summarize older conversation turns instead of carrying them verbatim. Truncate large tool results (a 10,000-line file read does not need to stay in context for the next 15 iterations). Drop tool results that are no longer relevant. The goal is to keep the context window at a stable size across iterations rather than letting it grow linearly.