ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Embedding Stores and Vector Databases
An embedding is a dense vector of floating-point numbers that encodes the semantic meaning of an input. Two sentences that mean the same thing end up close together in vector space, even if they share no words. That property is what makes semantic search, recommendation, and retrieval-augmented generation (RAG) possible.
Generating embeddings is straightforward in a notebook. Running an embedding pipeline that serves billions of objects reliably is an entirely different engineering problem.
How the pipeline works
A production embedding pipeline has three stages. First, raw content (text, images, audio) is preprocessed: cleaned, chunked, and normalized. Second, a model converts each chunk into a vector. Third, vectors are written to a vector store alongside metadata. Every stage must be fault-tolerant — a crash midway through a billion-document corpus is expensive to restart.
The chunking strategy matters more than it might seem. A 10,000-word article should not become a single embedding — the resulting vector averages meaning across the whole document and loses specificity. Instead, split it into overlapping 256–512 token chunks, where each chunk represents one coherent idea. The vector for "managing memory in C++" should not be diluted by content about "Python garbage collection" just because they appear in the same article.
Common model choices:
sentence-transformers/all-MiniLM-L6-v2— 384 dimensions, fast, English-focused, low costsentence-transformers/all-mpnet-base-v2— 768 dimensions, higher quality, Englishtext-embedding-3-small(OpenAI) — 1536 dimensions, strong multilingual coveragetext-embedding-3-large(OpenAI) — 3072 dimensions, highest accuracy, highest cost- Domain-specific fine-tuned models — better recall in narrow verticals (legal, medical, code)
Batch vs real-time generation
Batch generation runs offline: you iterate over a corpus, embed in large batches (512–2048 items per request), and write results to the vector store. GPU throughput is the bottleneck, and you optimize for items-per-second. Real-time generation runs at ingest time for streaming data — latency is the bottleneck and you optimize for p99 under a target SLA.
Most systems use both paths. Batch handles the initial corpus and nightly re-processing jobs. Real-time handles newly created or updated documents that need to be searchable within seconds. The two paths write to the same vector store but operate independently.
For batch jobs on a GPU, a single A100 can process roughly 5,000–15,000 embedding requests per second depending on model size and batch size. At 10K docs/second, processing 1 billion documents takes about 28 hours of continuous GPU time. Plan for this when scoping model upgrade windows.
Dimensionality trade-offs
Higher dimensions capture more semantic nuance but cost more everywhere: storage, memory, network bandwidth, and compute during both index build and query execution.
Raw storage per vector at float32 (4 bytes per dimension):
- 384d = 1,536 bytes = 1.5 KB per vector
- 768d = 3,072 bytes = 3.0 KB per vector
- 1536d = 6,144 bytes = 6.0 KB per vector
- 3072d = 12,288 bytes = 12.0 KB per vector
At 1 billion vectors with 768 dimensions: 1B × 768 × 4 = 3 TB of raw vector data before any index structure overhead. A purpose-built index (HNSW or IVF) adds another 1–4x on top of that, meaning the total index footprint for 1B 768d vectors ranges from 6 TB to 15 TB depending on the algorithm.
One mitigation is int8 quantization: convert each float32 to an 8-bit integer, reducing storage by 4x with a small recall penalty (typically 1–3% on recall@10). Most modern vector databases support int8 storage natively.

The model-change problem
Embedding models are not interchangeable. Two different models produce vectors in completely different high-dimensional spaces — a vector from all-MiniLM-L6-v2 is meaningless when compared to a vector from text-embedding-3-large. Cosine similarity between vectors from different models produces random noise, not relevance signal.
When you upgrade your embedding model, every vector in your store must be recomputed and re-indexed. At a billion documents that can take days of GPU time. The migration requires careful coordination:
- Provision a shadow index sized for the new model's dimensionality
- Run a background batch job to re-embed the full corpus into the shadow index
- Set up dual-write so newly ingested documents go to both indexes during migration
- Once the shadow index is complete, run offline evaluation on a labeled query set
- If recall meets the target, atomically swap the serving index pointer
- Keep the old index alive for 48 hours as a rollback target
This is why teams treat embedding model versions like database schema versions — migrations need to be planned, tested, and executed with the same care as a major schema migration.
Multi-modal embeddings
Embedding models are not limited to text. The same vector store and ANN index infrastructure works for images, audio, and video — as long as the modality has an appropriate embedding model.
Image embeddings are produced by vision transformers (ViT) or CLIP. CLIP (Contrastive Language-Image Pretraining) is particularly powerful because it embeds both images and text into the same vector space, enabling cross-modal queries: searching for images by text description ("red sports car at sunset") and vice versa. A product catalog with CLIP embeddings allows customers to upload a photo of an item they want and retrieve visually similar products — a query type that BM25 cannot handle at all.
Audio embeddings are produced by models like wav2vec or Whisper features. They map audio clips to vectors that are close when the audio is perceptually similar. Applications include music similarity search (Spotify's recommendation pipeline), speaker verification, and audio content moderation.
Multi-modal systems face the same production challenges as text: large vectors (CLIP ViT-L/14 produces 768d float32 vectors), expensive generation (image inference is slower than text inference), and the model-change problem on upgrade. Multi-modal adds one additional complication: cross-modal alignment. If you update the text encoder but not the image encoder, cross-modal queries degrade because the two modalities no longer share the same embedding space.
Pipeline failure modes
Every stage of the embedding pipeline has a characteristic failure mode:
Preprocessing failures: if a document encoding is misdetected (treating UTF-16 as UTF-8, or HTML entities left unescaped), the model sees garbage tokens and produces a semantically meaningless vector. The vector is numerically valid — no exception is raised — but it clusters randomly in vector space rather than with semantically related content. Catch this by validating that decoded text is printable and non-empty before passing to the embedding model.
Model API failures: external embedding APIs (OpenAI, Cohere) rate-limit at token throughput. When hitting rate limits, the correct behavior is exponential backoff with jitter — not immediate retry loops that exhaust the limit again. Set a max retry budget of 5 attempts with delays of 1s, 2s, 4s, 8s, 16s. After exceeding the budget, write the document ID to a dead-letter queue for manual review.
Duplicate vectors: if the same document is processed twice (due to a pipeline restart), it gets two nearly identical vectors in the index. This wastes storage and can cause the same document to appear twice in search results. Use an idempotency key (the document's canonical ID) to detect and skip already-indexed documents. Most vector databases support upsert semantics: insert a new vector or update the existing one by ID.
Stale vectors: when a document's content changes, its vector becomes stale. The old semantics remain indexed while the new content is unsearchable. Implement update triggers: when a document is modified, re-embed it and upsert the vector. Track last_embedded_at in metadata and run nightly jobs to re-embed documents whose content has changed since their last embedding.
Metadata schema drift: as the application evolves, new metadata fields are added (a new language field, a quality_score, a region tag). Old vectors in the index have no value for these new fields. This creates inconsistency: pre-filter queries on the new field return only recently indexed documents, silently excluding the older corpus. Handle this by backfilling metadata: after adding a new field, run a background job to set a default value on all existing vectors that lack the field. Most vector databases support metadata updates (payload updates) without requiring re-embedding.
Monitoring and quality signals
An embedding pipeline has two distinct quality concerns: infrastructure health and retrieval quality. Infrastructure health is easy to monitor: track throughput (vectors/second), error rates, queue depth, and disk write latency. Retrieval quality is harder because it requires labeled data.
Set up a canary query set: 500–1,000 queries with known relevant documents. Run this set daily against your vector index and measure recall@10. A sudden drop in recall@10 — even without any deployment — signals index drift, a corrupt segment, or a model API change. Alert when recall drops more than 2 percentage points below the baseline.
Track the following operational metrics in production:
- Embedding queue depth: how many documents are waiting to be embedded. A growing queue means the embedding pipeline is a bottleneck.
- Indexing lag: the gap between when a document is written to the vector store and when it is fully indexed and searchable with high recall. Normal lag is under 60 seconds for HNSW.
- Index memory usage: HNSW indexes load entirely into RAM. If memory usage approaches the server limit, query latency degrades sharply as the OS starts swapping. Alert at 80% memory.
- Embedding API error rate: for externally hosted embedding models (OpenAI, Cohere), track HTTP 429 (rate limit) and 5xx errors separately. Rate limits require backoff and retry logic; 5xx errors require circuit breaking.
Choosing chunk size
The optimal chunk size depends on how users query. If users ask short, focused questions like "what is the capital of France", small 128-token chunks work well — each chunk answers one fact cleanly. If users ask multi-sentence questions that require synthesizing context across a passage, larger 512-token chunks with 64-token overlap preserve context better.
A common default: 256-token chunks with 32-token overlap. The overlap prevents information loss at chunk boundaries — a sentence split across two chunks will appear fully in one of them.
Some advanced implementations use sentence-level chunking (split on punctuation, target 3–5 sentences per chunk) rather than fixed token counts. This produces semantically coherent chunks at the cost of variable lengths, which adds complexity to the embedding pipeline.
Fine-tuning embedding models
General-purpose embedding models are trained on broad web data. When your domain has specialized vocabulary — legal citations, medical terminology, code identifiers, financial instrument names — a general model may fail to group semantically similar domain-specific items closely in vector space.
Fine-tuning uses contrastive learning: you provide pairs of (query, relevant_document) examples. The training objective is to bring the query vector and relevant document vector close together while pushing unrelated documents apart. The standard approach uses the InfoNCE (NT-Xent) loss with hard negatives — similar-but-wrong documents that the model must learn to distinguish from the correct one.
A practical fine-tuning pipeline:
- Collect 10,000–50,000 (query, positive_document) pairs from user click logs, expert annotations, or synthetic generation.
- Mine hard negatives: for each query, find the top-5 documents by BM25 that are not the positive document. These are the most confusing cases for the model.
- Fine-tune starting from a pretrained checkpoint (e.g.,
all-mpnet-base-v2) for 3–5 epochs with a learning rate of 2e-5. - Evaluate on a held-out set using recall@10. Expect 10–30% relative improvement on domain-specific queries.
Fine-tuned models require re-embedding the full corpus when deployed, just like a model upgrade. Budget the GPU time accordingly. Store training data version alongside the model checkpoint so the fine-tuning can be reproduced and extended as new training examples accumulate.
Cost modeling at scale
Before committing to an embedding strategy, model the full cost. Embedding cost has three components: generation, storage, and query compute.
Generation cost for OpenAI text-embedding-3-small at $0.02 per million tokens: a corpus of 500M documents with an average of 200 tokens per document = 100 billion tokens = $2,000 to embed the initial corpus. Re-embedding for a model upgrade = another $2,000. At this scale, self-hosting an open-source model on a GPU server ($3/hour for an A100) that processes 10M tokens/minute becomes cheaper than API calls after a few re-indexing cycles.
Storage cost: 500M × 768d × 4 bytes = 1.5 TB raw vectors. At $0.023/GB/month (S3 standard), storing vectors costs $34/month. The vector index in RAM is more expensive — an HNSW index on 500M vectors with M=32 needs roughly 600 GB of RAM, which requires multiple r6g.24xlarge instances at roughly $5/hour each.
Query cost: each vector query touches the in-memory HNSW index. At 1,000 QPS on a 500M-vector index with ef_search=100, the index processes approximately 100,000 distance computations per query × 1,000 QPS = 100 billion float operations per second. This saturates a single node's compute, requiring horizontal scaling.
These numbers make the case for dimensionality reduction: dropping from 1536d to 768d cuts all three costs in half. Dropping from float32 to int8 cuts storage and query compute by 4x with minimal recall impact. Model these trade-offs before finalizing the embedding strategy.
API vs self-hosted embedding models
The decision between API-based and self-hosted embedding models hinges on volume, latency, and data sensitivity.
API-based models (OpenAI, Cohere, Google) are the right choice when: total monthly token volume is under 10 billion tokens, you need multilingual support without training investment, or the team lacks ML operations expertise. The operational overhead is zero — no GPUs to manage, no model serving infrastructure.
Self-hosted models become justified when: monthly token volume exceeds 50 billion tokens (at which point GPU amortization beats API pricing), data privacy requirements prohibit sending content to external services, or you need to fine-tune on proprietary domain data. The operational requirement is a model serving layer — typically TorchServe, Triton Inference Server, or a simple FastAPI wrapper around the HuggingFace transformers library.
A hybrid approach: use the API for low-volume real-time embedding (new document ingestion at ingest time) and self-host for high-volume batch jobs (initial corpus embedding, nightly reprocessing). This minimizes infrastructure complexity while keeping API costs bounded.
Model upgrades are the most disruptive operation in a vector system. Treat your embedding model version like a database schema version: plan migrations in advance, never swap the model without re-indexing.