Introduction to Agentic AI
LLM Foundations
Reasoning and Planning
Memory and Knowledge
Agent Architectures
Safety and Reliability
Production Engineering
Real-World Agent Patterns
What Makes an AI Agent
There are three archetypes of LLM-powered systems, and confusing them leads to bad architecture decisions. Each one gives a different answer to the question: who decides what happens next?
Understanding these archetypes is not academic. Every design decision (what infrastructure you need, what guardrails to build, how to test, how to estimate cost) depends on which archetype your system falls into. Building agent infrastructure for a chatbot wastes engineering time. Building chatbot infrastructure for an agent creates a system without the safety guardrails it needs.

Chatbots
A chatbot is the simplest LLM application. The user sends a message, the LLM generates a response, and the conversation continues. The user drives every step. The LLM has no tools, no ability to take actions in the world, and no autonomy. It is text in, text out.
A customer support chatbot that answers questions from a knowledge base is the canonical example. The user asks "what is your return policy?" and the LLM generates a response from its training data or a retrieval-augmented context. Nothing else happens until the user speaks again. The LLM does not go look up the user's order, does not check if they have a pending return, and does not proactively suggest next steps. It waits.
Chatbots are surprisingly effective for many use cases. FAQ answering, creative writing, brainstorming, and conversational search all work well as chatbots. The mistake is confusing a chatbot's limitations (no tools, no autonomy) with inadequacy. A chatbot that answers 90% of support questions correctly is more valuable than an agent that takes 30 seconds to answer each question because it insists on looking up the user's account first.
Pipelines
A pipeline is a predetermined sequence of LLM calls designed by the developer. Each step feeds its output into the next step. For example, a document processing pipeline might: (1) summarize a legal contract, (2) extract key dates and obligations, (3) classify the contract type, and (4) generate a compliance checklist. The sequence never changes. The developer hardcoded every step. If step 2 fails, the pipeline fails. It does not reason about what went wrong or try an alternative approach.
The key trait of a pipeline is that the developer controls the flow. The LLM is a component within a fixed architecture, not a decision-maker. Even a pipeline with 50 steps and sophisticated prompts is not an agent if the developer predetermined every step.
Pipelines are the workhorse of production LLM applications. They are predictable, testable, and cheap to monitor. You know exactly how many LLM calls each input will trigger, so you can estimate cost precisely. You know exactly which step failed when something goes wrong, so debugging is straightforward. Many systems marketed as "AI agents" are actually pipelines, and that is fine. The label matters less than whether the architecture fits the problem.
Agents
An agent is an LLM in a loop that decides what to do next based on what it observes. It has tools: functions it can call to interact with the world (read files, search the web, query databases, execute code). At each step, the LLM examines the current state, reasons about what action to take, executes that action, observes the result, and decides whether to continue or stop.
Consider a coding assistant. The user says "fix the failing test in auth_service.py." The agent reads the test file, reads the source code, runs the test to see the error, edits the source code, runs the test again, sees it still fails, reads the error more carefully, makes a different fix, runs the test again, and finally confirms the test passes. No developer predetermined this sequence. The LLM decided each step based on what it observed.
Another example: a research agent asked to "find recent papers about transformer efficiency." The agent searches an academic database, reads the abstracts of the top results, notices most are about inference optimization, searches again with a refined query targeting training efficiency, finds a relevant survey paper, reads its references section, follows two promising citations, and finally synthesizes a summary. The path through these steps was determined by what the agent found at each step, not by a developer who anticipated this specific research question.
The Defining Distinction
The key distinction across these three archetypes is who decides the next step. In a chatbot, the user decides by sending the next message. In a pipeline, the developer decides by hardcoding the sequence. In an agent, the LLM itself decides based on observations.
This distinction matters because it determines the system's flexibility, cost, and risk profile. A chatbot cannot surprise you (it only responds to user input. A pipeline cannot adapt) it follows the same path every time. An agent can adapt to novel situations, but it can also make unexpected decisions, call tools in unintended ways, or loop indefinitely.
The Gray Areas
In practice, systems often sit between categories. A chatbot with a single tool (like a calculator) is not purely a chatbot and not purely an agent. A pipeline with one conditional branch based on LLM classification is not purely a pipeline and not purely a router. These gray areas are normal. The three archetypes are reference points, not rigid boxes.
When a system sits between categories, classify it by its dominant pattern. If the system primarily responds to user messages and occasionally calls a tool, it is a chatbot with tool augmentation. If the system runs a fixed sequence with one routing decision, it is a pipeline with a router. If the system loops with multiple tool calls and the LLM decides the sequence, it is an agent, even if some of the tools are simple.
Why the Labels Get Confused
In practice, these categories blur because marketing pushes everything toward "agent." A pipeline with a single routing step gets called an agent. A chatbot with retrieval-augmented generation gets called an agent. The way to cut through the noise is to ask: "Does the LLM decide what to do next based on what it observes?" If the answer is no (if the developer or the user determines the next step) it is not an agent, regardless of what the product page says.
This matters for engineering decisions. If you label a pipeline an "agent," you might add unnecessary complexity (a loop controller, stop conditions, error recovery) to a system that just runs three steps in order. If you label an agent a "pipeline," you might skip the guardrails (step limits, cost caps) that prevent runaway behavior. Getting the category right leads to the right architecture.
Common Misconceptions
"More LLM calls = agent." A pipeline that makes 10 LLM calls is still a pipeline if the developer predetermined each call. The number of LLM calls is irrelevant. What matters is whether the LLM decides the sequence.
"RAG makes it an agent." Retrieval-augmented generation (RAG) adds external knowledge to a chatbot or pipeline. The LLM receives retrieved documents as context, but it does not decide to search, evaluate results, and search again. RAG is a technique, not an architecture. A chatbot with RAG is still a chatbot.
"Agents are always better." Agents are more flexible but also more expensive, slower, and harder to test. A pipeline that works is better than an agent that "sometimes works differently." The right question is not "should we build an agent?" but "does this task require the LLM to make decisions based on intermediate observations?"
The defining feature of an agent is not intelligence or complexity. It is agency. The LLM decides what action to take next based on what it observes. A pipeline with 50 steps and a PhD-level prompt is not an agent if the developer hardcoded every step. A simple loop with 3 tools is an agent if the LLM chooses which tool to call and when to stop.
The core architecture of every agent is a loop. Whether the agent writes code, conducts research, handles customer support, or manages infrastructure, the underlying mechanism is the same: observe the current state, reason about what to do, act, and repeat. The loop is simple to describe but deceptively difficult to build well. Understanding it deeply is the foundation for everything else in this course.

Observe, Reason, Act, Repeat
Every iteration of the agent loop follows the same three-phase pattern.
Observe: The agent reads the current state. On the first iteration, this is the user's message. On subsequent iterations, it includes the results of previous tool calls, error messages, or other environmental feedback. The observation is everything the LLM needs to decide what to do next. Think of it as the agent opening its eyes and looking around. What does the world look like right now?
Reason: The LLM processes the observation and decides on the next action. Should it call a tool? Which tool? With what arguments? Or should it stop and return a final response to the user? This reasoning step is where the LLM's "intelligence" matters most. It must interpret results, detect errors, adjust its approach, and know when the task is complete. The quality of reasoning determines whether the agent takes 3 efficient steps or 15 confused ones.
Act: The agent executes the chosen action. If the LLM decided to call a tool, the system runs that tool and captures the result. If the LLM decided to respond to the user, the loop ends.
Then the cycle repeats. The tool result becomes a new observation, the LLM reasons again, and the process continues until the agent decides it is done or hits a safety limit.
The elegance of this pattern is that it applies universally. Every agent you will encounter (regardless of the domain, the LLM provider, or the framework) implements some version of this loop. The differences between agents lie in the tools they have, the prompts they use, and how they manage memory and stop conditions. The loop itself is always the same.
A Concrete Walkthrough
Imagine an agent with three tools (read_file, write_file, and run_tests) asked to "fix the bug in login.py that causes test_login_expired to fail."
Step 1. Observe: The agent sees the user's request. Reason: It needs to understand the failing test first. Act: Calls read_file("tests/test_login.py").
Step 2. Observe: The agent reads the test code and sees that test_login_expired checks whether expired tokens are rejected. Reason: It needs to see the implementation. Act: Calls read_file("login.py").
Step 3. Observe: The agent reads the login code and notices the token expiry check compares timestamps incorrectly. Reason: It knows how to fix the comparison. Act: Calls write_file("login.py", fixed_code).
Step 4. Observe: The file was written successfully. Reason: It should verify the fix. Act: Calls run_tests("tests/test_login.py").
Step 5. Observe: All tests pass. Reason: The task is complete. Act: Returns "Fixed the token expiry comparison in login.py. All tests pass."
Five steps, three different tools, each step chosen based on the previous result. If the tests had failed in step 4, the agent would have continued with a different fix. If the test file had imported from a different module, the agent would have read that module instead. The path adapts to what the agent finds.
Notice what did not happen: no developer wrote "if the test fails, re-read the error and try again." The LLM made that decision because it observed the test failure and reasoned that the fix was incorrect. This adaptive behavior is exactly what makes agents powerful for tasks with unpredictable execution paths, and exactly what makes them risky, because the LLM might reason incorrectly and take a counterproductive action instead.
The Loop in Code
Here is the agent loop reduced to its essential structure:
This pseudocode captures the entire pattern. The LLM receives the conversation history (including all previous tool results), decides whether to call a tool or respond, and the loop either continues or terminates. Every production agent (from simple coding assistants to complex research agents) is a variation of this loop.
Notice the critical line: if response.has_tool_calls. This is the decision point. When the LLM returns tool calls, it is saying "I need more information or need to take an action." When it returns plain text with no tool calls, it is saying "I am done." The LLM controls the loop's continuation, not the developer, not a counter, not an external signal. This is the architectural expression of agency.
Also notice that messages grows on every iteration. Each tool call and its result are appended. By step 8, the LLM is reading the original task plus 8 rounds of tool calls and results. This is why context management becomes the dominant engineering challenge in practice.
Why the Loop Is Hard
The pseudocode is 10 lines. The production implementation is thousands. The gap between the pseudocode and a production agent is filled with engineering challenges that do not appear until you deploy.
When to stop: The LLM must recognize when the task is complete. If it stops too early, the user gets an incomplete result. If it never stops, the agent burns tokens in an infinite loop.
Production agents use multiple stop conditions: the LLM explicitly says it is done, the maximum step count is reached, a timeout fires, or the token budget is exhausted. Getting these thresholds right requires observing real-world usage patterns.
How to handle errors: A tool call might fail (API timeout, invalid input, permission denied). The agent needs to observe the error, reason about whether to retry, try a different tool, or give up.
Naive agents retry the same failing call indefinitely. Good agents adjust their strategy after observing an error. Error handling in agents is fundamentally different from error handling in traditional software. In a pipeline, you write explicit try/catch blocks and define the recovery path in code. In an agent, the error becomes an observation that the LLM reasons about. The LLM sees "Error: connection timeout when querying database" and must decide: should I retry? Wait and retry? Try a different database? Ask the user for help? The quality of this reasoning depends on the LLM's capabilities and on how well the error message describes the problem. Returning a raw stack trace is less useful than returning "Database connection timed out after 5 seconds. This might indicate the database is under heavy load. Consider retrying in a few seconds or querying a different replica."
How to manage growing context: Every tool call and result is appended to the conversation history. After 20 tool calls, the context might contain thousands of tokens of intermediate results. The LLM must process all of this on every iteration. Context grows linearly with each step, and cost and latency grow with it.
Production agents use several strategies to manage context growth:
- Summarization replaces verbose tool results with concise summaries after they have been processed.
- Truncation drops the oldest messages when the context approaches the window limit.
- Selective retention keeps only the messages that are relevant to the current subtask.
- External storage writes intermediate results to a database and retrieves them only when needed.
We will cover these strategies in depth in later lessons on context engineering and memory management.
Non-Determinism Is the Default
A critical property of the agent loop is that it is non-deterministic. Give the same task to the same agent twice, and it may take different paths. On the first run, the agent might search for "Python async patterns," read three results, and synthesize an answer. On the second run, it might search for "Python concurrency," read different results, and produce a different answer. Both could be correct, but the paths differ.
This non-determinism comes from two sources. First, the LLM itself is probabilistic. The same input can produce different outputs, especially at higher temperature settings. Second, tools may return different results at different times. A web search today returns different results than the same search tomorrow.
Non-determinism has practical consequences for testing and debugging. You cannot write a traditional unit test that asserts "the agent will call tool X with argument Y on step 3." Instead, you test outcomes: "Did the agent produce a correct answer? Did it stay within the step limit? Did it avoid calling forbidden tools?"
This shift from testing execution paths to testing outcomes is one of the biggest mental model changes for engineers building their first agent system. Traditional software engineers are used to deterministic tests: given input X, the function always produces output Y. Agent engineers must accept that the path to Y may differ between runs, and focus their testing on whether Y is correct regardless of the path taken.
Debugging is similarly affected. When a traditional program fails, you examine the stack trace and reproduce the exact sequence of calls. When an agent fails, you examine the trace of reasoning steps and tool calls, but reproducing the exact sequence may not be possible because the LLM might reason differently on the next run. Agent debugging requires logging every step (observation, reasoning, action, result) so you can reconstruct what happened after the fact.
Parallel Tool Calls
Modern LLMs can return multiple tool calls in a single response. Instead of calling read_file("login.py") and then in the next step calling read_file("test_login.py"), the LLM can return both calls at once. The agent executes them in parallel, appends both results, and makes one LLM call to reason about both files simultaneously.
This is an important optimization. An agent that needs to read 3 files sequentially takes 3 iterations (3 LLM reasoning calls). An agent that reads all 3 in parallel takes 1 iteration, saving 2 LLM calls' worth of cost and latency. The pseudocode already handles this: the for tool_call in response.tool_calls loop processes multiple calls from a single response.
In practice, agents that support parallel tool calls complete tasks in fewer iterations and at lower cost. This is why the tool call interface matters. It is not just about what tools the agent has, but how efficiently it can use them in combination.
Not all LLM providers support parallel tool calls, and not all tasks benefit from them. Sequential tool calls are necessary when the second call depends on the result of the first. You cannot run read_file and fix_based_on_contents in parallel because the fix depends on what the file contains. Parallel calls work when the tools are independent: reading three different files, searching two different databases, or checking multiple conditions simultaneously.
The Cost of Each Iteration
Every iteration of the agent loop has a concrete cost. The LLM reads the entire conversation history (input tokens) and generates a response (output tokens). As the conversation grows, each iteration costs more because the input is larger.
Consider an agent that takes 8 steps to complete a task. On step 1, the input is 500 tokens (the user message plus system prompt). On step 8, the input might be 8,000 tokens (the original message plus 7 rounds of tool calls and results).
If the LLM costs $0.01 per 1,000 input tokens, step 1 costs $0.005 and step 8 costs $0.08nulla 16x increase.
The total cost of the 8-step task is not 8 times the cost of step 1; it is the sum of a growing series.
This cost growth is why max step limits and context management strategies are not optional. An agent that averages 5 steps is affordable. An agent that occasionally spirals to 50 steps can generate a surprise bill that exceeds the cost of all other steps combined.
In an interview, describe the agent loop as: 'The LLM calls a tool, gets the result, and decides whether to call another tool or respond to the user. It keeps going until it believes the task is done or hits a safety limit.' This three-sentence description covers 90% of what interviewers want to hear about agent architecture.
Not every LLM-powered system needs full agent capabilities. Autonomy is not binary. It is a spectrum. Placing your system at the right point on this spectrum is one of the most consequential design decisions you will make. Too little autonomy means the system cannot handle the variability your task requires. Too much autonomy means the system is expensive, unpredictable, and risky. The right level depends on the task, the stakes, and the cost you are willing to accept.

Level 1: Fixed Pipeline
The developer hardcodes every step. The LLM executes each step in order with zero decision-making. Input goes in, output comes out, the path is identical every time.
A document summarization pipeline that always summarizes, then extracts, then classifies is Level 1. The developer chose every step, every prompt, and every connection between steps. The LLM is a tool within the pipeline, not the decision-maker driving it.
This is the cheapest and most predictable option. You know exactly how many LLM calls each input triggers, debugging is straightforward (which step failed?), and testing is deterministic (same input, same output). It is also the least flexible. Any new requirement means changing the code, not just the prompt. If a new document type requires a different extraction approach, a developer must update the pipeline.
Level 2: Router
The LLM classifies the input and selects one of N predetermined paths.
A customer support system that routes incoming requests to billing, technical support, or account management based on the LLM's classification of the message is Level 2. The LLM makes one decision (which path), but the paths themselves are developer-defined.
Routers are the sweet spot for many production systems. The LLM handles the ambiguity of natural language classification (something that is difficult to do with rules alone) but the business logic remains deterministic. If the billing path has 5 steps, those 5 steps always execute in the same order. The only intelligence is in the routing decision.
This is a pragmatic balance. Natural language is messy and hard to classify with regex or keyword matching, so the LLM adds genuine value at the routing step. But once the category is determined, the remaining work is predictable and does not benefit from LLM reasoning. You get the flexibility of AI where it matters (understanding user intent) and the reliability of deterministic code where it matters (executing business logic).
Level 3: Agent
The LLM loops, calling tools as needed, deciding what to do at each step.
The developer defines the available tools and constraints, but the LLM chooses the sequence, handles branching, and decides when to stop. Most coding assistants, research agents, and task automation agents operate at this level. A human may review the agent's final output before it takes effect.
Level 3 is where things get interesting and dangerous. The agent can handle tasks the developer did not explicitly plan for, but it can also take unexpected actions, enter loops, or make errors that compound across steps. The developer's role shifts from designing the execution path to designing the constraints: what tools are available, what the agent is allowed to do, and what safety limits apply. This is a fundamentally different kind of engineering. In traditional software, the developer controls what happens. In agent engineering, the developer controls what can happen and trusts the LLM to make good decisions within those boundaries.
Level 4: Autonomous Agent
The agent operates with minimal human oversight.
It handles errors on its own, makes decisions without human approval, and may run for hours or days on complex tasks. A continuous monitoring agent that detects anomalies, investigates root causes, and applies fixes without human intervention is approaching Level 4.
Very few production systems operate at Level 4 today. The risk is too high for most applications. An autonomous agent that makes a wrong decision has no human checkpoint to catch the error before it propagates. The rare exceptions are low-stakes tasks where errors are easily reversible: automated code formatting, test generation, or content tagging where a wrong tag is trivially corrected.
Choosing the Right Level
More autonomy means more capability but also more risk, more cost (each reasoning step is an LLM call), and harder debugging (non-deterministic execution paths are difficult to reproduce and test).
Most production systems today operate at Level 2 or Level 3. Level 4 is reserved for low-stakes tasks where errors are cheap to reverse or for research environments where experimentation is the goal.
The right question is not "how autonomous can we make it?" but "what is the minimum autonomy required to solve this task well?" Start at Level 1 and move up only when the task genuinely requires it. This principle (minimum viable autonomy) appears throughout agent design and is one of the most important concepts in this course.
Real-World Examples by Level
Level 1 (Fixed Pipeline): A content moderation system that checks each post through a fixed sequence (toxicity detection, PII scanning, spam classification) and flags posts that fail any check. The steps never change.
Level 2 (Router): A help desk system that reads the customer's message, classifies it as "password reset," "billing question," "technical issue," or "general inquiry," and routes it to the appropriate workflow. Each workflow is a fixed sequence, but the routing step uses LLM intelligence.
Level 3 (Agent): A data analyst agent that takes a business question ("why did revenue drop last Tuesday?"), queries multiple databases, generates visualizations, identifies correlations, and writes a report. The agent decides which databases to query based on what it finds in earlier queries.
Level 4 (Autonomous Agent): A site reliability agent that monitors production systems, detects anomalies, investigates root causes by examining logs and metrics, and applies remediation (scaling up servers, rolling back deployments) without human approval. Very few teams trust this level of autonomy today.
The Cost of Each Level
Each level up the spectrum roughly doubles the complexity and cost of the system.
Level 1 costs exactly N LLM calls per input, where N is the number of pipeline steps. Cost is perfectly predictable. Debugging is straightforward. You check each step's output in sequence.
Level 2 costs 1 + N LLM calls, where 1 is the routing call and N is the number of steps in the chosen path. Cost is predictable per path. Debugging requires checking the routing decision first, then the path execution.
Level 3 costs an unpredictable number of LLM calls, anywhere from 2 to the maximum step limit. Cost varies per task. Debugging requires examining the full sequence of reasoning steps and tool calls, which may differ between runs. You need logging and observability infrastructure.
Level 4 multiplies Level 3's unpredictability by duration. The agent runs continuously, potentially making thousands of decisions per day. Debugging requires retroactive analysis of decision chains. You need alerting, audit trails, and rollback mechanisms.
Each level provides more capability but demands more engineering investment in safety and observability. The investment is not optional. It is the price of admission for that level of autonomy.
Upgrading Between Levels
A practical development strategy is to start at a lower level and upgrade only when you have evidence that the current level is insufficient.
Start with a Level 1 pipeline. If you notice that different inputs need different processing paths and you are writing too many if/else branches, upgrade the routing step to Level 2 by using an LLM to classify inputs.
If you find that even the correct path sometimes fails because intermediate results require adaptive handling, upgrade to Level 3 by adding a loop with tool access. But keep the scope tight: limit the tools, set conservative step limits, and require human review of outputs.
Only move to Level 4 after your Level 3 agent has been running reliably for months with human review, and the human reviewer is approving 99%+ of results unchanged. At that point, the human review step has become rubber-stamping, and removing it saves time without adding meaningful risk.
This incremental approach is safer, cheaper, and more debuggable than starting at Level 3 or Level 4. Each upgrade is a small, measured step with clear before-and-after metrics. You never invest in autonomy infrastructure you do not need yet, and you never carry the operational burden of a higher level without evidence that the current level is insufficient.
The opposite approach (starting at Level 3 or 4 and then trying to "lock down" an agent that is too autonomous) is much harder. Reducing autonomy after deployment means removing capabilities that users and workflows have come to depend on. It is always easier to add autonomy than to take it away.
Knowing what an agent is matters less than knowing when to use one. The previous sections explained the what: three archetypes, the agent loop, the autonomy spectrum. This section addresses the when. Agents are powerful but expensive, slow, and non-deterministic. Choosing an agent when a simpler approach works is one of the most common and costly mistakes in AI system design. Equally, choosing a pipeline when the task genuinely requires adaptability leads to brittle systems patched with endless special cases.
The guidelines below are not rigid rules. They are heuristics based on how production agent systems succeed and fail. Use them as a starting point for your design decisions.
When Agents Are the Right Choice
Tasks requiring iteration: The task involves trying something, observing the result, and adjusting. Writing code that must pass tests is the clearest example. The agent writes code, runs the tests, sees failures, fixes the code, and repeats until the tests pass. You cannot pipeline this because the fix depends on the specific error, which is unknown until the tests run. The number of iterations is also unpredictable. Sometimes the first fix works, sometimes it takes five attempts with completely different approaches.
Tasks requiring research: The task involves searching multiple sources, evaluating results, and synthesizing information. A research agent that searches the web, reads relevant articles, identifies gaps in its understanding, searches again with refined queries, and produces a summary is a natural fit. The number and sequence of searches depend on what the agent finds, making a fixed pipeline impossible. A pipeline can do one search, but it cannot decide "the results are insufficient, let me search with different keywords"nullthat requires reasoning about the quality of intermediate results.
Tasks with unpredictable steps: The task has a clear goal but the path to reach it varies. Customer support is a good example. The agent does not know whether the issue is a billing error, a technical bug, or an account problem until it investigates. It needs to ask questions, look up records, and adapt its approach based on what it discovers. A debugging agent faces the same challenge. The root cause of a production error could be a code bug, a configuration change, a dependency failure, or a resource exhaustion issue, and the agent must investigate each possibility until it finds the answer.
Tasks requiring multi-tool coordination: The task involves using several tools in a sequence that depends on intermediate results. An agent that reads a database, generates a chart, writes an analysis, emails the report, and creates a calendar follow-up needs to coordinate tools in an order that depends on the data it finds. If the database query returns no results, the agent should try a different query rather than generating an empty chart. If the chart reveals an anomaly, the agent might run an additional query to investigate before writing the analysis. The conditional branching based on intermediate results is what makes this an agent task rather than a pipeline task.
Notice the common thread across all four cases: the next step depends on what the agent observes. If you can predict every step in advance, you do not need an agent.
When Agents Are the Wrong Choice
Deterministic tasks: If every step is known in advance and does not change based on intermediate results, write a pipeline. ETL jobs, data validation, report generation with fixed templates, and batch processing are all examples where the developer knows every step before the system runs. An agent would "decide" to do the same thing every time, wasting tokens on unnecessary reasoning. You are paying for intelligence the task does not require.
Latency-sensitive tasks: Agents take seconds to minutes because each step requires an LLM call. If your application needs sub-second responses (autocomplete, real-time recommendations, fraud detection at transaction time), an agent is too slow. A single LLM call takes 500ms-2s; an agent making 5-10 calls takes 3-20 seconds total. Use a traditional ML model or a fixed pipeline for anything that needs to respond in under a second.
High-stakes irreversible actions without review: An agent that can delete production data, send financial transactions, or publish content without human review is an outage waiting to happen. The agent might be correct 99% of the time, but the 1% failure on an irreversible action can be catastrophic. If the action is irreversible and high-stakes, require a human approval step. This makes it a Level 3 system with a human checkpoint, not a Level 4 autonomous agent. The human review adds latency but prevents the rare catastrophic error that no amount of prompt engineering can eliminate entirely.
Tasks where simple logic works: If the problem can be solved with an if/else statement, a regex, or a database query, do not use an LLM at all. An agent that calls an LLM 5 times to classify an email as spam or not-spam is doing what a simple classifier does in milliseconds at a fraction of the cost. Before reaching for an agent, always ask: "Could I solve this with a Python function?" If yes, write the function.
A Simple Decision Framework
When deciding between an agent and a simpler architecture, ask three questions.
First: "Do the steps depend on intermediate results?" If the answer is no (if the same steps run in the same order regardless of what happens along the way) use a pipeline.
Second: "Does the system need to recover from failures adaptively?" If a failed step should trigger a different approach rather than a retry or abort, an agent can reason about alternatives. A pipeline would need explicit error-handling branches coded by the developer.
Third: "Is the latency acceptable?" If users expect results in under a second, an agent loop with multiple LLM calls is too slow. Use a pipeline or a traditional algorithm.
If all three answers point to an agent, build an agent. If any answer points away, consider a simpler architecture first. You can always upgrade from a pipeline to an agent later, but downgrading from an agent to a pipeline means admitting you over-engineered the first version.
The Cost of Getting It Wrong
The cost difference between agents and simpler architectures is not marginal. It is orders of magnitude.
A deterministic pipeline that processes a document in 3 fixed steps costs roughly 3 LLM calls. An agent doing the same task might make 3-15 LLM calls depending on its reasoning path, and each call after the first includes a growing context of previous results. In practice, agents cost 3-10x more per task than pipelines for equivalent functionality.
Latency follows the same pattern. A 3-step pipeline completes in about 3 sequential LLM calls (roughly 2-4 seconds total). An agent with 8 steps takes 8 sequential LLM calls (roughly 6-15 seconds), and each call is slower than the last because the context is larger.
These costs are acceptable when the agent provides genuine value, when the flexibility and adaptability of the agent loop leads to better results than a fixed pipeline could achieve. They are wasteful when the task is deterministic and the agent's "reasoning" always arrives at the same sequence of steps.
A useful heuristic: if an agent consistently takes the same path for a given type of input, that path should be hardcoded as a pipeline. Agents should be reserved for tasks where the path genuinely varies based on intermediate results. Monitor your agent's behavior in production, if 80% of tasks follow the same 3-step pattern, extract that pattern into a pipeline and only use the agent for the 20% that require adaptive reasoning.
The Anti-Pattern
The most common anti-pattern is building an agent when a pipeline would work. Teams get excited about agent capabilities and apply them to every problem. But agents are slower (multiple round trips to the LLM), more expensive (each step costs tokens), and non-deterministic (the same input may produce different tool call sequences). If you can draw the solution as a flowchart with no conditional branches based on LLM observations, you do not need an agent.
The reverse anti-pattern is less common but equally problematic: building a rigid pipeline for a task that genuinely needs adaptability, then patching it with endless if/else branches to handle edge cases. When your pipeline has more error-handling branches than happy-path steps, it is time to consider an agent.
The Hybrid Approach
In practice, the best systems are often hybrids. You do not have to choose between "agent for everything" and "pipeline for everything."
A common pattern is a pipeline with an agent fallback. The system tries the fast, deterministic pipeline first. If the pipeline fails or produces low-confidence results, it escalates to an agent that can reason about the edge case. This gives you the speed and cost efficiency of a pipeline for the 80% of cases that are straightforward, and the adaptability of an agent for the 20% that require reasoning.
Another hybrid pattern is an agent that uses pipelines as tools. The agent decides what to do next, but some of its "tools" are actually multi-step pipelines. For example, a research agent might call a "generate_report" tool that internally runs a fixed 4-step pipeline (query data, compute statistics, generate charts, format output). The agent handles the high-level reasoning (what to research, when to stop), while the pipeline handles the deterministic subtasks.
The key principle is: use the simplest architecture that handles each subtask correctly, and compose them as needed. Agents for decisions that require reasoning. Pipelines for sequences that are always the same. Simple functions for deterministic transformations. This principle (minimum viable autonomy at each layer) produces systems that are cheaper, faster, more predictable, and easier to debug than systems that use agents for everything.
The most expensive mistake in agent design is using an agent when a simple pipeline works. An agent that calls an LLM 5 times at $0.01 per call costs 50x more than a single deterministic function call. If the steps are known in advance and don't change based on intermediate results, write a pipeline. Reserve agents for tasks where the LLM genuinely needs to decide what to do next.
Every agent, regardless of complexity, is built from five components. A coding assistant, a customer support agent, and a research agent all share the same five building blocks. The implementation details differ, but the architecture is the same. Missing any one of these components leads to a specific, predictable failure mode. Understanding the components gives you a checklist for designing and evaluating agent systems, and a diagnostic framework when things go wrong.

1. System Prompt
The system prompt is the agent's brain. It defines who the agent is, what it can do, what it must not do, and how it should behave. Consider a code review agent. Its system prompt might include:
- Identity: "You are a code review assistant specializing in Python security and performance."
- Capabilities: "You can read files, run static analysis tools, and search the codebase."
- Constraints: "Never modify files directly. Never approve changes automatically. Always explain your reasoning."
- Behavior: "Start by understanding the purpose of the change before reviewing individual lines."
Without a system prompt, the agent has no identity or constraints. It will try to do anything the user asks, including tasks outside its domain, and its behavior will be inconsistent across conversations. The system prompt is the primary mechanism for controlling agent behavior, more reliable than fine-tuning and faster to iterate than code changes.
In practice, system prompts for production agents are long, often 500 to 2,000 words. They include the agent's role, its available tools and when to use each one, explicit constraints ("never delete files without user confirmation"), output format expectations, and examples of correct behavior. The system prompt is not a formality; it is the most important piece of engineering in the entire agent.
2. Tools
Tools are the agent's hands: the mechanism through which it interacts with the world beyond generating text. Reading files, searching the web, querying databases, sending emails, executing code, creating tickets, deploying software. All of these are tools. Without tools, the agent is just a chatbot. It can reason about the world but cannot act on it. The moment you give an LLM a tool and let it decide when to use it, you have crossed the line from chatbot to agent.
Each tool has a name, a description (which the LLM reads to decide when to use it), and parameters (which the LLM fills in based on context). The quality of tool descriptions matters enormously. A tool described as "search" gives the LLM no guidance on when to use it. A tool described as "search the company knowledge base for internal documentation about a specific technical topic" tells the LLM exactly when this tool is appropriate and when it should use a different search tool.
The number and design of tools also shape agent behavior. Too few tools and the agent lacks the capability to complete tasks. Too many tools (more than 15-20) and the LLM struggles to choose the right one, leading to incorrect tool selections. Well-designed agent systems provide focused, well-described tools with clear boundaries between them.
Tool design follows a key principle: make the right action easy and the wrong action hard. If you want the agent to search before answering, give it a search tool with a clear description and do not give it a "just guess" tool. If you want the agent to never delete data, either do not provide a delete tool, or gate it behind a confirmation step. The tool set is the strongest lever you have (aside from the system prompt) for shaping agent behavior.
3. Memory
Memory is the agent's context: everything it knows about the current task and past interactions. Without memory, an agent cannot build on its own previous actions within a single task, let alone across multiple tasks. Memory has two forms.
Short-term memory is the conversation history: every user message, assistant response, tool call, and tool result in the current session. This is what the LLM sees on each iteration of the agent loop. It grows with every step and is bounded by the context window size. When the context window fills up, the agent must either compress earlier messages (losing detail) or drop them entirely (losing context). Neither option is ideal, which is why context management is one of the most active areas of agent engineering research.
Long-term memory is persistent state across sessions: stored knowledge, user preferences, past conversation summaries, or domain knowledge retrieved from a database. Without long-term memory, every conversation starts from scratch. The agent cannot learn from past interactions or access information beyond what fits in the current context window. Imagine a customer support agent that resolves an issue for a customer on Monday, and then on Wednesday the same customer contacts it about a related issue, and without long-term memory, the agent has no knowledge of Monday's interaction and may ask the customer to repeat everything.
Without any memory, the agent is a goldfish. It cannot even reference what happened two steps ago in the current task, making multi-step reasoning impossible.
The tension with memory is capacity. Short-term memory (the conversation history) is bounded by the LLM's context window. A 128K-token window sounds large, but after 30 tool calls with verbose results, it fills up fast. Long-term memory requires external storage (databases, vector stores) and retrieval logic, which adds complexity. Getting memory right is one of the hardest problems in agent engineering.
4. Loop Controller
The loop controller is the agent's heartbeat: the runtime engine that drives the observe-reason-act cycle. It manages state between iterations, appends tool results to the conversation history, checks stop conditions, and hands control back to the LLM for the next decision. In the pseudocode from the previous section, the for loop and the message appending logic are the loop controller.
Without a loop controller, the agent is a one-shot system. It receives input, generates one response, and stops, no iteration, no tool use, no adaptation. Many "agent" demos are actually one-shot systems without a proper loop controller. The LLM generates a tool call, the tool executes, and the result is returned to the user without giving the LLM a chance to reason about it and decide what to do next. The loop controller transforms a single LLM call into an iterative problem-solving system.
The loop controller is also responsible for managing state between iterations: appending tool results to the message history, tracking how many steps have been taken, and deciding whether to continue the loop or invoke a stop condition. In production systems, the loop controller often includes logging, cost tracking, and latency measurement at each step.
5. Stop Conditions
Stop conditions tell the agent when to halt. Every production agent needs multiple stop conditions because no single mechanism is reliable enough on its own.
Explicit stop: The LLM decides the task is complete and generates a final response without any tool calls. This is the ideal case. The agent finished its work and knows it. Some systems reinforce this by including instructions in the system prompt: "When you have completed the task, respond with your final answer and do not call any more tools."
Implicit stop: The LLM generates a response with no tool calls, which the system interprets as "done." This is the same mechanism as explicit stop but the LLM may not have consciously decided to stop. It simply had nothing left to do.
Safety limits: Maximum iteration count, maximum token budget, maximum wall-clock time. These fire when the LLM fails to stop on its own, typically because it entered a loop, encountered a task it cannot complete, or is making progress too slowly. Safety limits prevent infinite loops that burn money. Typical production values are 10-25 maximum steps, a token budget of 50,000-200,000 tokens per task, and a wall-clock timeout of 2-5 minutes.
Without stop conditions, the agent will loop indefinitely. Every iteration costs tokens and time. An agent with no stop conditions and a stuck reasoning loop can burn through hundreds of dollars in API costs before anyone notices. This is not a theoretical risk. It happens regularly in production systems without proper guardrails.
The best practice is defense in depth: use all three types of stop conditions simultaneously. The LLM's own judgment handles the happy path (task complete, stop naturally). Safety limits handle the failure path (stuck loop, unreachable goal). The combination ensures the agent always terminates within a bounded cost and time envelope.
Common Failure Modes
Each missing or misconfigured component produces a distinct failure pattern that is recognizable once you know what to look for:
No system prompt: The agent behaves inconsistently. It answers questions outside its domain, uses different tones across conversations, and occasionally takes dangerous actions because it has no constraints. The fix is straightforward: write a comprehensive system prompt.
No tools (or wrong tools): The agent generates plausible-sounding but fabricated information because it cannot access real data. Or it repeatedly tells the user "I cannot do that" because it lacks the tool required for the task. The fix is to give the agent the tools it needs, with clear descriptions.
No memory management: The agent works well on short tasks (2-3 steps) but degrades on longer tasks. It starts forgetting earlier steps, repeating actions it already took, or hitting context window limits. The fix is to implement context summarization or selective retention.
No stop conditions: The agent runs up large bills on tasks where it gets stuck. A single confused conversation generates 50+ LLM calls before a human notices. The fix is to add a maximum step limit (start with 15-20 and tune from there).
The Component Checklist
When designing or evaluating an agent, check for all five components. Each missing component maps to a predictable failure:
- No system prompt = unpredictable behavior
- No tools = just a chatbot
- No memory = cannot do multi-step reasoning
- No loop controller = one-shot only
- No stop conditions = can loop forever
This checklist is useful in interviews, architecture reviews, and debugging. When an agent behaves unexpectedly, the first diagnostic step is: which of the five components is misconfigured? A system prompt that lacks constraints? A missing tool for a required capability? A context window that overflows after too many steps? A loop controller that does not append error messages? A stop condition set too high or too low? Most agent failures trace back to one of these five components.
How the Components Interact
The five components are not independent. They influence each other in important ways.
The system prompt affects tool selection. A system prompt that says "always search the knowledge base before answering" changes how the LLM uses its tools. A system prompt that says "never call the delete tool without user confirmation" constrains which tools the agent considers at each step.
Memory affects reasoning quality. As short-term memory fills up with tool results, the LLM has more context for its decisions, but also more noise. A tool result from step 2 that is irrelevant by step 8 still consumes attention capacity. The loop controller can mitigate this by summarizing or truncating older context.
Stop conditions interact with the loop controller. The loop controller checks stop conditions after each iteration. If the max step limit is too low, the agent terminates before completing complex tasks. If it is too high, stuck agents waste resources. Tuning stop conditions requires observing how many steps typical tasks take and setting limits with appropriate headroom.
Tools affect memory growth. A tool that returns 5,000 tokens of output per call fills the context window 10x faster than a tool that returns 500 tokens. Designing tools with concise output formats directly improves how many steps the agent can take before hitting context limits.
Understanding these interactions helps you design agents where the components work together rather than against each other. A well-designed agent is not five independent components bolted together. It is an integrated system where each component is tuned with awareness of the others.
For example, if you know your tools return verbose output (large database query results, full web pages), you should design your memory strategy to summarize these results aggressively, set your stop conditions to account for faster context growth, and write your system prompt to instruct the agent to extract only the relevant information from each tool result. All five components are shaped by the same design decision about tool output verbosity.