Search Relevance vs Latency: Why the Best Ranker Can Tank Your Conversion

March 6, 2026


Search is not one problem, it is two. The first stage is retrieval, where you ask "of the billion documents in the corpus, which thousand are even plausible for this query?" The second stage is ranking, where you ask "of those thousand, which ten should appear in this order?" Retrieval optimizes for recall. Ranking optimizes for precision. They have different data structures, different models, and different latency budgets.

Retrieval has two common shapes. Lexical retrieval uses an inverted index and a score like BM25, which is fast because posting list intersection is mechanical and cache friendly. Dense retrieval embeds the query into a vector and searches an approximate nearest neighbor index like HNSW, which catches paraphrases and semantic matches but pays a graph traversal cost. Most production systems run both and merge candidates, because each catches what the other misses.

Ranking is where the learned features live. You take the few hundred candidates from retrieval and score them with click-through rate, dwell time, freshness, personalization, and whatever else your features team has shipped this quarter. The ranker is small in fan-in but heavy per item, so its cost scales with the shortlist size.

The latency budget is where teams get burned. A reasonable end-to-end target is 100ms. Network and serialization eat 20ms before you do any work, leaving roughly 10ms for retrieval and 30ms for ranking, with margin for tail variance. If you blow either number, the whole budget collapses. And the budget is per request at p99, not on average. A search that runs in 40ms on a warm cache but 600ms when the HNSW shard cold-starts will show up in user metrics as broken, even if the median dashboard looks healthy.

The production failure I keep watching unfold: a team upgrades the ranker from a 10-feature linear model to a 200-feature gradient boosted tree. Offline NDCG@10 goes up 4%, the launch deck looks great, and the model ships. Then p99 search latency jumps from 80ms to 380ms because the new ranker scores every one of the 1000 retrieved candidates. Relevance is genuinely better, but conversion drops 7% in the next week. Users abandon slow searches faster than they punish slightly worse ones.

The fix is a two-stage ranker. Run the cheap model over all 1000 candidates, take the top 100, then run the expensive model only on that shortlist. You spend the heavy compute where it matters and keep retrieval-stage fan-in cheap. Relevance metrics hold, latency stays under budget, and the conversion regression disappears. The lesson: in search, your model's score is one input. The user's patience is the other.

Key takeaway

Search is retrieve then rank, with a hard latency budget split across both stages. A smarter ranker is only a win if it stays inside the budget, because users abandon slow searches faster than they tolerate slightly worse results.

Originally posted on LinkedIn. View original.


All Rights Reserved.