Retrieval-Augmented Generation

Introduction to Agentic AI

Retrieval-Augmented Generation

Topics Covered

Why RAG Exists

The Three Knowledge Problems

How RAG Works

RAG vs Fine-Tuning

RAG vs Long Context

The RAG Pipeline

The Ingestion Pipeline

Choosing an Embedding Model

Vector Database Selection

Prompt Construction and Generation

Evaluating the Pipeline

Chunking Strategies

Fixed-Size Chunking

Semantic Chunking

Recursive and Hierarchical Chunking

Document Structure Awareness

Chunk Size Tradeoffs

How Chunks Affect Embedding Quality

Overlap Between Chunks

Parent Document Retriever

Metadata Enrichment

Chunking for Code

Evaluation and Iteration

Retrieval and Ranking

Semantic Search

Keyword Search

Hybrid Search

Reciprocal Rank Fusion

Reranking

Metadata Filtering

Score Normalization

When to Add Reranking

Multi-Index Search

Query Expansion and Transformation

RAG Failure Modes

Common Failure Modes

Systematic Debugging

Building an Evaluation Pipeline

Production Observability

Agentic RAG

When Static RAG Is Sufficient

The Future of RAG

Large language models are trained on massive datasets, but their knowledge is frozen at a cutoff date. An LLM trained in early 2024 knows nothing about events, product updates, or policy changes from late 2024. This is not a bug, it is a fundamental property of how these models work. The knowledge lives in the model's weights, and updating those weights means retraining. Consider the timeline: GPT-4 was trained on data through April 2023, then released months later. By the time users interact with it, the knowledge gap is already 6-12 months wide. For a fast-moving company, that gap means the model knows nothing about last quarter's product pivot, the new compliance regulation that took effect last month, or the pricing change announced two weeks ago. This problem compounds because model training takes months, by the time a new model version is trained on data from January 2025 and released in June 2025, the knowledge is already six months stale. There is no way to "patch" a model's knowledge without retraining or using external retrieval.

Worse, when an LLM encounters a question it cannot answer from its training data, it does not say "I don't know." It fabricates a plausible-sounding answer with complete confidence. This is hallucination, and it is the single biggest obstacle to deploying LLMs in production where accuracy matters: legal, medical, financial, customer support. Hallucination is not a random glitch. It happens because LLMs are trained to predict the most likely next token. When the model has no factual basis for an answer, it still generates the statistically most plausible continuation, which sounds authoritative but is entirely invented. The model cannot distinguish between "I know this fact from training data" and "I am generating a plausible-sounding sequence." This is why hallucinations are so dangerous: they are indistinguishable from correct answers without external verification.

LLMs also have zero access to private or proprietary data. Your company's internal wikis, HR policies, customer records, and product documentation never appeared in the training set. The model literally cannot answer questions about them, but it will try anyway and produce fiction. In an enterprise context, this is a showstopper. A bank's internal risk models, a hospital's patient treatment protocols, a law firm's case files. None of this data exists in any public training corpus. An enterprise that deploys a vanilla LLM for internal Q&A gets an assistant that confidently fabricates HR policies, invents compliance procedures, and hallucinates employee benefit details. The liability exposure alone makes ungrounded LLM deployment untenable for most organizations.

Here is a concrete example that makes these three problems tangible. Ask an LLM trained in 2024 about your company's Q1 2025 product launch and it will either refuse or fabricate details. Ask a RAG-powered system the same question and it retrieves the actual launch documents from your knowledge base and generates an answer grounded in those documents. The difference is not subtle. One response is fiction, the other is fact with a citation pointing to the exact source document.

How RAG Works

Retrieval-Augmented Generation solves these problems by adding a retrieval step before generation. Instead of asking the LLM to answer from memory, you first search a knowledge base for documents relevant to the user's question, inject those documents into the prompt as context, and then ask the LLM to generate an answer grounded in that context. The model shifts from "answer from what you memorized" to "answer from what I just showed you."

This approach has a name (retrieval-augmented generation) because the retrieval step augments what the model knows at generation time. The model is not smarter or more knowledgeable than before. It is simply reading relevant documents before answering, much like a human expert who consults reference materials before giving advice. The quality of the answer depends entirely on the quality of the retrieved documents, which is why the retrieval pipeline is the most critical component of the system.

The architecture has two phases that run at different times. The offline phase (ingestion) happens when documents are created or updated, the system parses documents, splits them into chunks, converts each chunk into a vector embedding, and stores the vectors in a searchable index. The online phase (query time) happens when a user asks a question, the system embeds the query, searches the index for similar vectors, retrieves the top matching chunks, constructs a prompt with those chunks as context, and sends it to the LLM. Keeping these phases separate is what makes RAG efficient: the expensive document processing happens once, while the lightweight retrieval and generation happens per query.

Grounding is the key concept here. Grounding means the model's response is anchored in specific retrieved text rather than vague parametric memory. A grounded answer can point to the exact document and passage that supports each claim. An ungrounded answer comes from the model's statistical patterns, which may or may not reflect reality. In production systems, grounding is not optional, it is the mechanism that transforms an LLM from a creative writing tool into a reliable information system.

To summarize, RAG solves three distinct problems. First, the knowledge cutoff problem: the model's training data has a fixed end date, but the retrieval database can be updated in real time. Second, the hallucination problem: by grounding responses in retrieved documents and instructing the model to answer only from the provided context, fabrication drops dramatically. Third, the private data access problem: proprietary documents never need to enter the training set because they live in an external knowledge base that the model reads at query time. Each of these solutions reinforces the others, a system that retrieves current documents avoids knowledge cutoff issues, and a system that cites its sources makes hallucination detectable even when it occurs.

It is worth emphasizing that RAG does not eliminate hallucination entirely. The LLM can still misinterpret retrieved context, over- extrapolate from partial information, or blend facts from multiple chunks in incorrect ways. What RAG does is dramatically reduce the surface area for hallucination. Instead of making up facts from nothing, the model is constrained to work with specific retrieved documents. When it does hallucinate, the error is usually a misinterpretation of real data rather than pure fabrication, which is easier to detect and correct through prompt engineering and evaluation.

The practical implication is that RAG does not replace the need for evaluation and quality assurance. It dramatically reduces the problem surface, but it does not eliminate it. Production RAG systems still need monitoring, evaluation pipelines, and human review of edge cases. The difference is that the failure modes are more predictable and diagnosable than with a standalone LLM.

RAG vs Fine-Tuning

Why not fine-tuning? Fine-tuning bakes knowledge into model weights. It is expensive (hundreds to thousands of dollars per run) and slow, taking hours to days. Every time your data changes, you must retrain. For a company whose documents change weekly, this means weekly retraining runs. RAG keeps knowledge external in a searchable database. When a document changes, you re-index it in seconds without touching the model.

Beyond cost and speed, RAG has three additional advantages over fine-tuning. Controllability: you can change what the model knows by updating the knowledge base without any model modification. If a policy changes, you update the document and re-embed it. The next query automatically gets the new information. With fine-tuning, you have no way to surgically update one fact without retraining the entire model. Auditability: every RAG answer can cite the specific documents it drew from, creating a clear audit trail. A fine-tuned model produces answers from opaque weights with no way to trace which training document caused a specific output. And cost predictability. RAG's ongoing cost is proportional to query volume and index size, not to retraining cycles that can fail and require reruns.

Fine-tuning does have a legitimate role, but it is about behavior rather than knowledge. If you need the model to adopt a specific tone, follow a particular response format, or learn domain-specific vocabulary and reasoning patterns, fine-tuning is the right tool. Many production systems use both: fine-tuning to shape the model's behavior and RAG to supply the model's knowledge. The two approaches are complementary, not competing.

A useful mental model: fine-tuning changes how the model thinks and responds. RAG changes what the model knows. A medical chatbot might be fine-tuned to always include disclaimers, use clinical terminology appropriately, and structure responses with diagnosis and next steps. But the actual medical knowledge (drug interactions, treatment protocols, clinical guidelines) comes from RAG retrieval over a curated medical knowledge base. Trying to fine-tune all that knowledge into the model's weights would be expensive, impossible to update, and impossible to audit when a patient outcome depends on which guideline the model referenced.

RAG vs Long Context

Why not just use bigger context windows? Models now support 100K to 1M tokens, so why not load everything? Three reasons. First, cost: a 200K-token prompt costs roughly 100 times more than a 2K-token prompt. To put concrete numbers on this: at typical API pricing, a 200K-token input costs around $0.60 per query. A focused 2K-token RAG prompt costs around $0.006. At 10,000 queries per day, the long-context approach costs $6,000 daily versus $60 for RAG, a 100x difference that compounds rapidly.

Second, latency: processing 200K tokens takes seconds to tens of seconds. Third, the lost-in-the-middle problem: research shows LLMs pay disproportionate attention to the beginning and end of long contexts while partially ignoring information in the middle. RAG retrieves only the 3-5 most relevant chunks, keeping the prompt focused and affordable.

There is a fourth reason that is often overlooked: the needle-in-a-haystack problem. Even when an LLM can technically process a million tokens, its ability to find and use a specific piece of information buried deep in that context degrades as the context grows. Studies show that when the relevant information is placed in the middle of a very long context, the model frequently ignores it entirely. RAG sidesteps this problem by extracting only the relevant needles from the haystack and placing them front and center in a short, focused prompt. The model does not need to search through a million tokens, it receives exactly the 3-5 chunks it needs and nothing else.

Long context windows are a valid choice in one narrow scenario: when the total knowledge base is small (under 50K tokens), changes infrequently, and query volume is low enough that the per-query cost is acceptable. Beyond that threshold, RAG is the economically and technically superior approach.

There is a hybrid approach worth considering: use RAG to retrieve the most relevant chunks and then use a moderately sized context window (8K-16K tokens) to fit those chunks alongside the query and system prompt. This gives you the precision of RAG retrieval with enough context window to include 5-10 substantial chunks. The cost stays low because the prompt is focused, and the LLM receives only the information it needs rather than everything you have.

Key Insight

RAG flips the economics of knowledge. Fine-tuning costs hundreds of dollars per run and takes hours to update. RAG updates knowledge by re-indexing a document in seconds. This is why RAG dominates production AI, not because it produces better answers, but because it lets you change the answers without retraining the model.

Course

Introduction to Agentic AI

LLM Foundations

The Agent Paradigm

Reasoning and Planning

Memory and Knowledge

Agent Architectures

Safety and Reliability

Production Engineering

Real-World Agent Patterns