Introduction to Agentic AI
LLM Foundations
The Agent Paradigm
Memory and Knowledge
Agent Architectures
Safety and Reliability
Production Engineering
Real-World Agent Patterns
Reasoning Patterns
When you ask an LLM "What is 17 times 24?" it might answer incorrectly. When you ask it "What is 17 times 24? Think step by step," it almost always gets it right. This is not a prompt trick. It fundamentally changes the computation the model performs.
The key insight is that output tokens serve as working memory. An LLM generates one token at a time, and each new token can attend to all previous tokens. Without chain-of-thought (CoT), the model must compute the entire answer in a single forward pass, the equivalent of solving a multi-step problem in your head without writing anything down. With CoT, each intermediate step becomes part of the context, and the model can reference its own reasoning when computing the next step.

This matters for agent design because agents face multi-step problems constantly. "Find the bug in this code" requires reading the code, identifying suspicious patterns, tracing data flow, and formulating a hypothesis. Without CoT, the model tries to jump directly from code to diagnosis. With CoT, it can work through the problem systematically, just like a human developer would.
Chain-of-thought reasoning is not the model 'explaining its answer.' It is the model performing additional computation. Each output token becomes part of the input for generating the next token, effectively giving the model a scratchpad. More reasoning tokens means more computation, which is why CoT improves accuracy on hard problems but wastes tokens on easy ones.
Zero-Shot vs Few-Shot CoT
Zero-shot CoT uses a simple instruction like "think step by step" or "reason through this carefully." The model generates its own reasoning structure. This works well for most tasks and is the default approach in agent systems because you cannot predict what problems the agent will encounter.
Few-shot CoT provides examples of the reasoning format you want. You show the model a worked example with the exact step structure, and it follows the same pattern. This is more reliable for specific, repeated task types, like extracting structured data from documents or classifying support tickets, where you know the reasoning steps in advance.
The Scratchpad Pattern
Some agent implementations explicitly separate reasoning from output. The model writes its thinking in a designated scratchpad section, then produces the final answer separately. This has two benefits. First, you can parse the final answer without stripping reasoning text. Second, you can log the reasoning for debugging without exposing it to downstream systems. Claude's extended thinking and OpenAI's o-series models formalize this pattern at the model level. The reasoning happens in a dedicated thinking block that is separate from the response.
Chain-of-thought improves accuracy, but a single reasoning chain can still go wrong. The model might make an arithmetic error in step 3 that propagates through the rest of the reasoning. Self-consistency addresses this by generating multiple independent reasoning chains for the same problem and taking the majority answer.

The intuition is simple. If you ask three people to independently solve the same math problem and two of them get 42 while one gets 37, you trust 42. Self-consistency applies the same logic to LLM reasoning chains. Each chain reasons differently. Different intermediate steps, different phrasing, different order of operations, but correct reasoning tends to converge on the same answer while errors are random and diverge.
How Self-Consistency Works
Generate N reasoning chains (typically 3-5) using temperature > 0 to ensure diversity. Extract the final answer from each chain. Take the majority vote. If all chains agree, confidence is high. If they split 2-1 or 3-2, confidence is lower and you might escalate to a human or apply additional verification.
The cost is linear. 3 chains cost 3x the tokens. This is why self-consistency is used selectively, not on every query. Reserve it for high-stakes decisions where the cost of an error exceeds the cost of additional tokens.
Verification Strategies
Self-consistency is one form of verification. Others include asking the model to check its own work ("Review your answer. Is it correct?"), using a second model to verify the first model's output, and running the output through deterministic validators (type checkers, unit tests, schema validators). In agent systems, the most effective verification is often executing the output. A code agent that runs its generated code and checks for errors is more reliable than one that reasons about correctness without testing.
Standard chain-of-thought lets the model think for a few dozen tokens. Extended thinking takes this to an extreme: reasoning chains of thousands or tens of thousands of tokens. Models like Claude's extended thinking mode and OpenAI's o-series are specifically trained to reason deeply, producing long internal monologues before answering.
The difference is not just length. Extended thinking models are trained with reinforcement learning on tasks where longer reasoning leads to better answers. They learn to explore multiple approaches, backtrack when a line of reasoning fails, and verify their own intermediate steps. Standard CoT prompting asks a general model to show its work. Extended thinking uses a model that has been specifically optimized to think deeply.
How Extended Thinking Works
When you enable extended thinking (Claude) or use an o-series model (OpenAI), the model generates a long reasoning trace in a dedicated thinking block. This trace is visible for debugging but separate from the final response. The model might reason for 5,000 tokens internally before producing a 200-token answer. The thinking block often contains self-correction: the model writes "wait, that's not right" and tries a different approach.
For agent systems, extended thinking is most valuable at decision points. When the agent must choose between multiple possible approaches to a complex task, extended thinking lets it reason through the trade-offs rather than picking the first plausible option. The cost is significant: those 5,000 thinking tokens are billed, but for high-stakes decisions, the improved accuracy is worth it.
When Extended Thinking Is Overkill
Extended thinking is the most expensive reasoning strategy. A simple tool routing decision ("should I search the web or query the database?") does not need 5,000 tokens of deliberation. The engineering challenge is knowing when to activate deep reasoning and when to keep it shallow. Most production agent systems use extended thinking selectively: for planning phases, complex debugging, and multi-constraint decisions, while using standard inference for routine operations.
Every reasoning token costs money and time. A standard Claude Sonnet call that generates 100 tokens takes about 1 second and costs a fraction of a cent. An extended thinking call that generates 10,000 reasoning tokens takes 10-30 seconds and costs significantly more. For an agent that makes 20 LLM calls per task, these costs compound quickly.

When designing agent systems, treat reasoning depth as a tunable parameter, not a fixed choice. Simple tool routing decisions should use the fastest, cheapest model. Complex planning and debugging decisions should use deeper reasoning. Most production agents use 2-3 reasoning tiers: fast (standard model, no CoT), medium (standard model with CoT), and deep (extended thinking or reasoning model). The router that selects the tier is itself a fast, cheap call.
The Cost Curve
The relationship between reasoning depth and accuracy is not linear. For easy tasks, adding reasoning tokens gives diminishing returns. The model is already correct without them. For medium tasks, CoT provides a significant accuracy boost at moderate cost. For hard tasks, extended thinking provides another jump in accuracy but at 10-100x the token cost. The optimal strategy depends on your task distribution. If 80% of tasks are easy, 15% medium, and 5% hard, applying deep reasoning to everything wastes 80% of your budget on tasks that do not benefit.
Latency Implications
Users notice latency. A 1-second response feels instant. A 5-second response feels slow. A 30-second response requires a progress indicator and user patience. For interactive agents (chatbots, coding assistants), latency directly affects user satisfaction. Streaming helps: the user sees tokens appearing in real-time, but the total time to a complete answer still matters. For background agents (data processing, automated testing), latency is less critical and deeper reasoning is more acceptable.
Token Budgeting
Production agent systems set token budgets per task. A support agent might have a budget of 5,000 tokens per ticket (across all LLM calls). If a single extended thinking call consumes 10,000 tokens, it blows the budget for the entire task. Token budgeting forces the team to be deliberate about where reasoning tokens are spent, rather than using deep reasoning by default.
Not every problem benefits from more thinking. The skill of building effective agent systems is knowing when to reason deeply and when to act quickly. This is not a theoretical question. It directly determines your agent's cost, speed, and user experience.

A common mistake is assuming that more reasoning always means better results. For simple, well-defined tasks, chain-of-thought can actually hurt performance. The model over-thinks the problem, introduces unnecessary considerations, and sometimes reasons itself into the wrong answer. If a straightforward lookup achieves 98% accuracy and CoT achieves 97% accuracy while taking 3x longer, the reasoning is actively harmful.
When Reasoning Helps
Multi-step problems where the answer depends on combining several pieces of information. Math word problems, multi-hop question answering, and code debugging all benefit from step-by-step reasoning.
Ambiguous instructions where the agent must interpret what the user actually wants. "Clean up this code" could mean reformatting, refactoring, adding error handling, or removing dead code. Reasoning helps the agent consider the context and choose the right interpretation.
Novel situations that do not match common patterns in the training data. When the agent encounters something it has not seen before, reasoning through first principles is more reliable than pattern matching.
Complex tool selection when multiple tools could satisfy the request and the agent must evaluate which one is appropriate based on the specific context.
When Reasoning Hurts
Simple lookups: "What is the user's email?" Just call the tool.
Well-defined classifications: routing a message to one of 5 departments. The categories are clear and the model does not need to deliberate.
Latency-critical paths: autocomplete suggestions, real-time transcription, and other cases where the user expects sub-second responses. Any reasoning overhead is directly felt.
Repetitive tasks with identical structure: processing 1,000 invoices with the same format. Reason about the first one to establish the pattern, then apply mechanically to the rest.