RAG Architecture for Production

ML Systems & Infrastructure

RAG Architecture for Production

Topics Covered

The Production RAG Pipeline

Why RAG Over Fine-Tuning

The Full Pipeline

Embedding Model Selection

Vector Databases and Metadata Filtering

Chunking at Scale

Why Chunk Size Matters

Fixed-Size Chunking

Semantic Chunking

Document-Structure Chunking

Overlapping Chunks

Parent-Child Chunking

Retrieval and Reranking Strategies

Dense Retrieval

Sparse Retrieval with BM25

Hybrid Retrieval

Reranking with Cross-Encoders

ColBERT: Late Interaction

Retrieval Parameters

RAG Evaluation

The Three Dimensions of RAG Quality

Building Evaluation Datasets

Evaluation Frameworks

Iterating on RAG Quality

Failure Analysis Taxonomy

Systematic Debugging

RAG vs Fine-Tuning Decision Framework

Prompt Engineering for RAG

Large language models are trained on snapshots of the internet. The moment training ends, they know nothing about events after that date, nothing about your company's internal documents, and nothing about data that was never public. Retrieval-Augmented Generation (RAG) solves this by fetching relevant information at query time and injecting it into the LLM's prompt. Instead of asking the model to recall facts from its parameters, you give it the facts and ask it to reason over them. This is why RAG became the default architecture for enterprise AI applications: it lets you use a general-purpose LLM with your private, up-to-date data without retraining.

Why RAG Over Fine-Tuning

Fine-tuning bakes knowledge into model weights. This works for stable domain knowledge (medical terminology, legal language patterns) but fails for dynamic data. If your knowledge base changes daily, you cannot fine-tune daily. RAG keeps the knowledge external in a searchable index that you update independently of the model. The model stays general, and the index stays current.

Fine-tuning also requires training infrastructure, GPU hours, and expertise in hyperparameter tuning. RAG requires a vector database and an embedding model, both of which are available as managed services. For most production use cases, RAG is the faster path to a working system.

The Full Pipeline

A production RAG system has two phases: ingestion and query.

The ingestion phase runs offline. Documents enter the system, get split into chunks, each chunk gets embedded into a vector, and those vectors are stored in a vector database alongside the original text and metadata. This phase runs once per document and again when documents are updated.

The query phase runs in real time. A user's question arrives, gets embedded using the same embedding model, and the vector database returns the most similar chunks. Those chunks are passed to the LLM as context along with the question. The LLM generates an answer grounded in the retrieved context.

Full production RAG pipeline showing ingestion (document to chunks to embeddings to vector DB) and query (question to embedding to retrieval to LLM generation)

Embedding Model Selection

The embedding model converts text into dense vectors. Your choice here determines retrieval quality. General-purpose models like OpenAI's text-embedding-3-small (1536 dimensions) or Cohere's embed-v3 work well across domains. Domain-specific models trained on medical, legal, or scientific text outperform general models on those domains but underperform on others.

Dimension count matters. Higher dimensions capture more nuance but increase storage costs and search latency. A 1536-dimension model storing 10 million vectors needs roughly 60 GB of memory for the vectors alone. Some models offer configurable dimensions (OpenAI's text-embedding-3 models support 256, 1024, or 3072), letting you trade precision for efficiency.

Key Insight

The embedding model is the single most impactful choice in a RAG system. A better chunking strategy with a weak embedding model will underperform a simple chunking strategy with a strong embedding model. When debugging RAG quality, start by evaluating whether your embedding model captures the semantic distinctions your queries require.

Vector Databases and Metadata Filtering

Vector databases store embeddings and support approximate nearest neighbor (ANN) search. Pinecone, Weaviate, Qdrant, Milvus, and pgvector (a PostgreSQL extension) are common choices. Each makes different trade-offs between managed vs self-hosted, indexing algorithms, and supported metadata filters.

Metadata filtering narrows the search space before vector similarity runs. If a user asks about Q1 2025 revenue, you filter to documents tagged with date metadata in Q1 2025 before running the vector search. This prevents the retriever from returning semantically similar but temporally irrelevant chunks. Production systems typically store metadata like source document, date, author, department, and document type alongside each vector.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone