Search and Ranking Systems

Course

ML Systems & Infrastructure

Search and Ranking Systems

Topics Covered

Information Retrieval Fundamentals

Scoring with TF-IDF

BM25 — The Industry Standard

The Keyword Gap

Learning to Rank

Three Approaches to Ranking

Click Data as Training Signal

Feature Engineering

Query Understanding

Spell Correction

Query Expansion

Query Rewriting

Intent Classification

Result Ranking Pipeline

The Multi-Stage Architecture

Feature Computation

Latency Budget

Search Quality Metrics

Offline Metrics

Why NDCG Is the Standard

Online Metrics

The Feedback Loop Danger

Information Retrieval Fundamentals

The core problem in search is deceptively simple: given a query, find the most relevant documents from billions. The difficulty is not finding matches — it is finding the right matches, fast enough that the user does not notice the system thinking.

The foundational data structure that makes this possible is the inverted index. Instead of scanning every document for your query terms (forward search), an inverted index flips the relationship: it maps each term to the list of documents containing it. When you search for "distributed consensus," the system looks up "distributed" (appears in documents 4, 17, 203, 891) and "consensus" (appears in documents 17, 44, 891, 2033), then intersects the lists to find documents containing both terms (17, 891). This intersection runs in milliseconds even over billions of documents because the posting lists are sorted and the intersection is a linear merge.

Search pipeline showing query flowing through understanding, retrieval from inverted index and vector store in parallel, then ranking

Scoring with TF-IDF

Finding documents that contain the query terms is only the first step. You need to rank them. TF-IDF (Term Frequency-Inverse Document Frequency) is the classic scoring formula. Term frequency measures how often a term appears in a document — a document mentioning "consensus" ten times is probably more about consensus than one mentioning it once. Inverse document frequency measures how rare the term is across all documents — "the" appears everywhere and carries no signal, while "consensus" appears in 0.1% of documents and is highly informative. Multiplying the two gives a score that rewards documents where rare, meaningful terms appear frequently.

BM25 — The Industry Standard

TF-IDF has a flaw: it rewards term frequency without limit. A document mentioning "consensus" 100 times scores 10x higher than one mentioning it 10 times, even though the 100-mention document is probably keyword stuffed. BM25 fixes this with two improvements. First, it adds saturation — after a term appears a few times, additional occurrences contribute diminishing returns. Second, it adds document length normalization — a 10,000-word document naturally contains more term occurrences than a 200-word document, so BM25 adjusts for length. Elasticsearch and Apache Solr both use BM25 as their default scoring function.

Key Insight

BM25 has survived decades because it is remarkably hard to beat with machine learning for simple keyword queries. It already captures the key statistical signal — term importance relative to corpus frequency — and ML models need thousands of labeled training examples to learn what BM25 gets from statistics alone. For most keyword search use cases, BM25 is the right starting point before adding complexity.

The Keyword Gap

Keyword-based retrieval has a fundamental limitation: it only matches exact terms. A search for "laptop" will not return documents about "notebook computer." Synonyms, paraphrases, and conceptual matches are invisible to an inverted index. A user searching "how to fix slow database queries" will miss a document titled "optimizing SQL performance" even though it answers their question perfectly. This gap between lexical matching and semantic understanding is why modern search systems combine keyword retrieval with vector-based semantic search, running both in parallel and merging results.

Learning to Rank

BM25 and TF-IDF use statistical formulas with fixed parameters. They work well for keyword queries but cannot incorporate signals like document freshness, user location, historical click-through rates, or the hundreds of other signals that determine what makes a result relevant for a specific user at a specific moment. Learning to rank (LTR) replaces the fixed formula with a machine learning model trained on relevance judgments.

Three Approaches to Ranking

Pointwise — The simplest approach. Train a model to predict an absolute relevance score for each query-document pair independently. Feed in features (BM25 score, document age, PageRank, query-document similarity) and predict a relevance label (0 = irrelevant, 1 = somewhat relevant, 2 = highly relevant). The problem: ranking is about relative order, not absolute scores. A model that predicts "document A is 0.7 relevant and document B is 0.6 relevant" works, but it was never explicitly trained to place A above B. It optimizes for score accuracy, not ranking quality.

Pairwise — Instead of scoring documents independently, train the model on pairs. Given documents A and B for the same query, predict which is more relevant. RankNet uses a neural network to learn a scoring function where the probability that A ranks above B is a sigmoid of the score difference. LambdaRank improves this by weighting each pair by how much swapping them would change NDCG — a swap between positions 1 and 2 matters far more than a swap between positions 98 and 99. This focus on relative order is why pairwise methods outperform pointwise for ranking tasks.

Listwise — Optimize the entire ranked list directly. LambdaMART (a gradient-boosted tree model using LambdaRank's gradients) is the most successful listwise approach and has won multiple information retrieval competitions. It directly optimizes for ranking metrics like NDCG by computing gradients that measure how moving each document up or down would change the metric.

Learning-to-rank feedback loop showing click data feeding back into ranking model training

Click Data as Training Signal

Explicit relevance labels (human judges rating each query-document pair) are expensive — a team of judges can label maybe 10,000 pairs per week. A search engine handles millions of queries per day. Click-through data fills the gap: if a user searches "python tutorial" and clicks the third result but skips the first two, that suggests the third result was more relevant. At scale, these implicit signals create massive training datasets for free.

But click data is biased. Position bias is the most damaging: users click the first result 30-40% of the time regardless of its relevance. The second result gets 15-20% of clicks, the third 10-12%, and so on. If you train on raw clicks, the model learns that top-ranked documents are the most relevant — which is circular. It reinforces the current ranking rather than improving it.

Common Pitfall

Position bias is the most dangerous trap in learning to rank. Users click the first result 30-40% of the time regardless of relevance. If you train on raw click data without debiasing, your model learns to preserve the current ranking rather than improve it. The ranking becomes frozen — good documents stuck at position 10 never get clicked, so the model never learns they are relevant.

Debiasing techniques include inverse propensity weighting (weight each click by the inverse of its position probability — a click at position 5 counts more than a click at position 1 because the user had to scroll past other options) and randomization experiments (show results in a slightly shuffled order for a small fraction of traffic to measure position-independent click rates).

Feature Engineering

The ranking model consumes features from four categories:

Text match features — BM25 score, exact phrase match, term overlap between query and title versus body, TF-IDF cosine similarity.

Document features — PageRank (authority), publish date (freshness), document length, domain authority, spam score, number of inbound links.

User features — Search history, geographic location, language preference, device type, past click patterns for similar queries.

Interaction features — Historical click-through rate for this specific query-document pair, dwell time on previous clicks, bounce rate, add-to-cart rate for e-commerce.

The best ranking models use hundreds of features. Feature engineering — deciding which signals to compute, how to combine them, and how to keep them fresh — is where most of the practical work in search ranking lives.

Query Understanding

Before a query reaches the retrieval engine, it needs to be understood. Users misspell words, use ambiguous terms, and type short fragments that could mean many things. Query understanding transforms the raw input into a clean, expanded, classified signal that retrieval and ranking can work with effectively.

Spell Correction

Users misspell roughly 10-15% of queries. "pythn tutorial" should become "python tutorial." The simplest approach uses edit distance — find dictionary words within 1-2 character insertions, deletions, or substitutions of the query term. But edit distance alone is noisy: "pythn" is equidistant from "python" and "piton." Context resolves the ambiguity.

N-gram language models score candidate corrections by how likely they are in context. "python tutorial" has a much higher bigram probability than "piton tutorial" because "python tutorial" appears frequently in query logs. The most effective approach combines edit distance candidates with query log frequency: the most common reformulation of "pythn" in historical logs is "python," and that statistical evidence is more reliable than any language model.

Query Expansion

Even correctly spelled queries are often too narrow. A search for "headache medicine" should also retrieve documents about "aspirin," "ibuprofen," and "acetaminophen." Query expansion adds synonyms, related terms, and hypernyms (broader category terms) to the original query. Sources for expansion include curated synonym dictionaries, embedding-based similarity (terms with nearby vectors), and query reformulation mining from logs (users who searched "headache medicine" frequently followed up with "ibuprofen dosage").

Query understanding pipeline showing misspelled query corrected, expanded with synonyms, classified by intent, and dispatched to the appropriate retrieval strategy

Query Rewriting

Query rewriting goes beyond expansion — it transforms ambiguous or poorly structured queries into more effective ones. A query like "apple" is ambiguous: does the user want the company or the fruit? A query like "cheap good laptop 2024" could be rewritten to "best budget laptops 2024" which aligns better with how product review documents are written. Modern systems use sequence-to-sequence models trained on query-to-reformulation pairs mined from session logs (the original query and whatever the user searched next when the first query failed to satisfy them).

Intent Classification

Different queries need different retrieval strategies. Intent classification typically divides queries into three categories:

Navigational — The user wants a specific page. "Facebook login" or "Amazon." The correct result is a single URL, and the system should prioritize exact URL matching and domain authority.

Informational — The user wants to learn something. "How does DNS work" or "symptoms of flu." The system should prioritize content quality, comprehensiveness, and authority.

Transactional — The user wants to buy or do something. "Buy iPhone 15 case" or "book flight to Tokyo." The system should prioritize product pages, pricing, availability, and user reviews.

Why does intent matter? A navigational query for "YouTube" that returns ten blog posts about YouTube instead of youtube.com is a catastrophic failure. An informational query about "how to bake bread" that returns only product pages for bread-making equipment misses the intent. Each intent category triggers different retrieval paths, different ranking feature weights, and different result presentation.

Result Ranking Pipeline

Running a sophisticated ranking model on every document in the corpus is computationally impossible. A corpus of 10 billion documents and a ranking model that takes 1 millisecond per document would require 116 days to score a single query. The solution is a multi-stage pipeline where each stage reduces the candidate set while increasing scoring sophistication.

The Multi-Stage Architecture

L0 — Retrieval (milliseconds, returns thousands). The inverted index and vector store run in parallel. The inverted index returns documents matching keyword terms (using BM25 scoring). The vector store returns documents with embeddings nearest to the query embedding (using approximate nearest neighbor search). The union of both sets forms the candidate pool — typically 5,000 to 50,000 documents. This stage is fast because inverted index lookups and ANN search are sub-linear operations.

L1 — Lightweight scoring (tens of milliseconds, prunes to hundreds). A simple model (logistic regression or a small neural network) scores each candidate using a small feature set: BM25 score, document authority, freshness, and query-document embedding similarity. This stage runs on CPU and prunes the candidate set from thousands to 200-500 documents. The goal is to discard obviously irrelevant candidates cheaply.

L2 — Heavy scoring (tens of milliseconds, returns top 50). A full-featured model (LambdaMART with 200+ features, or a BERT-based cross-encoder) scores the remaining candidates. This model has access to the complete feature set: text match features, user personalization features, real-time interaction features, and cross-document features. It is too expensive to run on 50,000 candidates but produces highly accurate rankings for the 200-500 that survive L1.

L3 — Business logic and reranking (milliseconds, returns final results). The final stage applies non-ML adjustments: diversity (ensure the top 10 are not all from the same source), freshness boosts for time-sensitive queries, demotion of results the user has already seen, and business rules (promoted listings, policy compliance). This stage reshuffles the L2 output rather than rescoring from scratch.

Interview Tip

In interviews, describing the multi-stage ranking architecture immediately signals that you understand the latency-cost trade-off at scale. Saying 'just run BERT on every document' reveals you have not considered that scoring 10 billion documents with a transformer model would take days per query. The staged approach is how every production search engine actually works.

Feature Computation

Some features are precomputed and stored: PageRank, document embeddings, spam scores, document length, entity tags. These are calculated in offline pipelines and stored in feature stores or indexed alongside the documents. They add zero latency at query time.

Other features are computed at query time: BM25 score (depends on the query terms), query-document embedding cosine similarity (depends on the query embedding), user-query affinity (depends on who is searching), real-time click-through rate (depends on recent traffic). These must be calculated during the request and represent the bulk of L1 and L2 computation time.

Latency Budget

A typical search engine targets 200 milliseconds end-to-end. The budget is split roughly as follows: query understanding (10-20ms), L0 retrieval (20-40ms), L1 scoring (20-30ms), L2 scoring (40-80ms), L3 reranking (5-10ms), network overhead and rendering (20-30ms). Every stage has a strict timeout — if L2 scoring exceeds its budget, the system returns L1 results rather than making the user wait. This graceful degradation is critical: a slightly worse ranking delivered in 200ms beats a perfect ranking delivered in 2 seconds.

Search Quality Metrics

Building a ranking model means nothing if you cannot measure whether it is working. Search quality metrics divide into offline metrics (computed on labeled datasets) and online metrics (computed on live traffic). Both are necessary: offline metrics tell you if the model is improving, online metrics tell you if users notice.

Offline Metrics

Precision@K — Of the top K results, what fraction is relevant? If K=10 and 7 of the top 10 results are relevant, Precision@10 = 0.7. Simple and intuitive, but it treats all positions equally — a relevant result at position 1 counts the same as one at position 10.

Recall@K — Of all relevant documents in the corpus, what fraction appears in the top K? If there are 50 relevant documents and 8 appear in the top 10, Recall@10 = 0.16. Recall matters for comprehensive searches (finding all relevant legal precedents) but less for navigational queries where one good result suffices.

MRR (Mean Reciprocal Rank) — The reciprocal of the position of the first relevant result, averaged across queries. If the first relevant result is at position 3, the reciprocal rank is 1/3. MRR captures how quickly users find a relevant result. It is ideal for navigational queries where the user wants exactly one answer.

NDCG (Normalized Discounted Cumulative Gain) — The standard metric for search ranking. Unlike Precision@K, NDCG handles graded relevance (a document can be "highly relevant," "somewhat relevant," or "irrelevant" rather than just relevant/not relevant) and applies a logarithmic position discount (a relevant result at position 1 contributes far more than one at position 10). The score is normalized to 0-1 by dividing by the ideal ranking's score, making it comparable across queries with different numbers of relevant documents.

MAP (Mean Average Precision) — Average precision computed at each position where a relevant document appears, then averaged across queries. It captures both precision and the ranking of relevant documents.

Pure relevance producing homogeneous results on the left, diversity reranking improving coverage on the right

Why NDCG Is the Standard

NDCG has three properties that make it the default choice. First, graded relevance — binary relevant/irrelevant labels lose information. A four-star document and a two-star document are both "relevant" under binary labels, but the user experience of finding the four-star document first is significantly better. NDCG captures this difference. Second, position discounting — users examine results from top to bottom, so the metric should weight top positions more heavily. NDCG's logarithmic discount models this attention decay naturally. Third, normalization — by dividing by the ideal ranking's score, NDCG produces a 0-1 value that is comparable across queries regardless of how many relevant documents exist.

Online Metrics

Offline metrics measure model quality on a frozen dataset. Online metrics measure what actually happens when real users encounter real results:

Click-through rate (CTR) — Fraction of displayed results that users click. Higher CTR generally indicates more relevant results, but can be inflated by clickbait titles.

Abandonment rate — Fraction of queries where the user leaves without clicking anything. High abandonment suggests the results were irrelevant or the user's question was already answered by snippets (which is actually a good outcome).

Time to first click — How long after results appear does the user click something? Faster clicks suggest the top results were visibly relevant. Slow clicks suggest the user had to scan multiple results before finding something useful.

Reformulation rate — Fraction of queries followed by another search within the same session. High reformulation suggests the results did not satisfy the information need and the user is trying different terms.

The Feedback Loop Danger

Optimizing purely for clicks creates a perverse incentive. Clickbait titles ("You won't believe what happens next") generate high CTR but low user satisfaction. Authoritative, well-written content with descriptive titles ("How DNS Resolution Works: A Complete Guide") may get fewer clicks because users can partially answer their question from the snippet alone. If you optimize the ranking model for CTR, clickbait rises and quality content falls. The solution is to combine click signals with satisfaction signals: dwell time (did the user stay on the page for 30+ seconds, suggesting they found value?), return-to-search rate (did the user immediately come back and search again, suggesting the result was unsatisfying?), and long-click rate (clicks where the user does not return to the search results within 30 seconds).

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone