C Finding relevant document snippets for search result display
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A good search result snippet does not just show the first sentence of a document. It should show the part of the document that best explains why the result matched the query. That usually means finding a short window of text where the query terms are dense and context is still readable.
In C#, a practical snippet generator usually has three stages: normalize the query, score candidate windows inside the document, and then format the winning window with highlighting or ellipses.
What Makes a Snippet Relevant
A useful snippet usually has these properties:
- it contains one or more query terms
- it keeps enough surrounding context to be readable
- it prefers windows where multiple important terms appear together
- it avoids cutting words awkwardly or starting in the middle of noise
For example, if the query is distributed cache invalidation, a snippet that contains all three words in one paragraph is far better than a snippet that only contains cache near the top of the document.
A Simple Window-Scoring Strategy
A practical baseline is to slide a fixed-size window over the document and score each window by the number and quality of query-term hits.
This is intentionally simple, but it already produces better snippets than “take the first 160 characters.”
Improve Scoring Beyond Raw Counts
Raw match count is a start, not the finish line. Better scoring can reward:
- distinct query terms, not just repeated copies of one term
- exact phrase matches
- earlier term appearances inside the snippet
- proximity between terms
- matches in titles or headings near the snippet location
If the query is multiword, a window containing distributed, cache, and invalidation once each is usually better than a window containing cache three times and nothing else.
Highlight the Matched Terms
A snippet becomes much more useful when matched terms are emphasized.
In a web app you would usually emit HTML markup instead of markdown-style markers, but the idea is the same.
Handle Boundaries Cleanly
Users notice ugly snippets immediately. Even a well-scored snippet looks bad if it starts mid-word or ends in broken punctuation.
A common refinement is to expand the chosen window outward to the nearest whitespace or sentence boundary. That gives you text that reads more naturally.
You can also prefix or suffix ellipses when the snippet came from the middle of the document rather than the start.
When This Needs a Search Index
For small documents, direct scanning is fine. For large search systems, snippet generation is usually tied to the index. Search engines often store token offsets so they can jump straight to matching regions instead of rescanning the full document body every time.
That is especially important when:
- documents are large
- queries are frequent
- search results must render quickly
The algorithm stays similar, but the match positions come from the index instead of from ad hoc regex scans.
Common Pitfalls
A common mistake is showing the first sentence of the document regardless of the query. That is easy to implement and often useless to the user.
Another mistake is scoring windows only by total hit count instead of distinct-term coverage and proximity.
Developers also forget formatting quality. A snippet that technically matches the query but begins halfway through a word feels broken.
Finally, do not over-highlight. If every common term is emphasized, the snippet becomes noisy and harder to scan.
Summary
- A good search snippet should explain why the document matched the query.
- A fixed-size window with term-based scoring is a strong baseline.
- Better scoring rewards distinct query terms, proximity, and phrase matches.
- Highlighting and clean boundary handling matter almost as much as scoring.
- At larger scale, use index token offsets to generate snippets efficiently.

