C Finding relevant document snippets for search result display

search algorithms

document snippets

information retrieval

search result optimization

C Finding relevant document snippets for search result display

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A good search result snippet does not just show the first sentence of a document. It should show the part of the document that best explains why the result matched the query. That usually means finding a short window of text where the query terms are dense and context is still readable.

In C#, a practical snippet generator usually has three stages: normalize the query, score candidate windows inside the document, and then format the winning window with highlighting or ellipses.

What Makes a Snippet Relevant

A useful snippet usually has these properties:

it contains one or more query terms
it keeps enough surrounding context to be readable
it prefers windows where multiple important terms appear together
it avoids cutting words awkwardly or starting in the middle of noise

For example, if the query is distributed cache invalidation, a snippet that contains all three words in one paragraph is far better than a snippet that only contains cache near the top of the document.

A Simple Window-Scoring Strategy

A practical baseline is to slide a fixed-size window over the document and score each window by the number and quality of query-term hits.

csharp

1using System;
2using System.Collections.Generic;
3using System.Linq;
4using System.Text.RegularExpressions;
5
6public static class SnippetFinder
7{
8    public static string BestSnippet(string text, string query, int windowSize = 160)
9    {
10        var terms = query.Split(' ', StringSplitOptions.RemoveEmptyEntries)
11                         .Select(t => t.ToLowerInvariant())
12                         .ToHashSet();
13
14        int bestScore = -1;
15        string best = text.Length <= windowSize ? text : text[..windowSize];
16
17        for (int i = 0; i < text.Length; i += 20)
18        {
19            int length = Math.Min(windowSize, text.Length - i);
20            string window = text.Substring(i, length);
21            int score = Score(window, terms);
22            if (score > bestScore)
23            {
24                bestScore = score;
25                best = window;
26            }
27        }
28
29        return best.Trim();
30    }
31
32    private static int Score(string window, HashSet<string> terms)
33    {
34        var words = Regex.Matches(window.ToLowerInvariant(), @"\w+")
35                         .Select(m => m.Value);
36        return words.Count(terms.Contains);
37    }
38}

This is intentionally simple, but it already produces better snippets than “take the first 160 characters.”

Improve Scoring Beyond Raw Counts

Raw match count is a start, not the finish line. Better scoring can reward:

distinct query terms, not just repeated copies of one term
exact phrase matches
earlier term appearances inside the snippet
proximity between terms
matches in titles or headings near the snippet location

If the query is multiword, a window containing distributed, cache, and invalidation once each is usually better than a window containing cache three times and nothing else.

Highlight the Matched Terms

A snippet becomes much more useful when matched terms are emphasized.

csharp

1using System.Text.RegularExpressions;
2
3public static string HighlightTerms(string snippet, IEnumerable<string> terms)
4{
5    foreach (var term in terms.Distinct().OrderByDescending(t => t.Length))
6    {
7        snippet = Regex.Replace(
8            snippet,
9            Regex.Escape(term),
10            m => $"**{m.Value}**",
11            RegexOptions.IgnoreCase);
12    }
13    return snippet;
14}

In a web app you would usually emit HTML markup instead of markdown-style markers, but the idea is the same.

Handle Boundaries Cleanly

Users notice ugly snippets immediately. Even a well-scored snippet looks bad if it starts mid-word or ends in broken punctuation.

A common refinement is to expand the chosen window outward to the nearest whitespace or sentence boundary. That gives you text that reads more naturally.

You can also prefix or suffix ellipses when the snippet came from the middle of the document rather than the start.

When This Needs a Search Index

For small documents, direct scanning is fine. For large search systems, snippet generation is usually tied to the index. Search engines often store token offsets so they can jump straight to matching regions instead of rescanning the full document body every time.

That is especially important when:

documents are large
queries are frequent
search results must render quickly

The algorithm stays similar, but the match positions come from the index instead of from ad hoc regex scans.

Common Pitfalls

A common mistake is showing the first sentence of the document regardless of the query. That is easy to implement and often useless to the user.

Another mistake is scoring windows only by total hit count instead of distinct-term coverage and proximity.

Developers also forget formatting quality. A snippet that technically matches the query but begins halfway through a word feels broken.

Finally, do not over-highlight. If every common term is emphasized, the snippet becomes noisy and harder to scan.

Summary

A good search snippet should explain why the document matched the query.
A fixed-size window with term-based scoring is a strong baseline.
Better scoring rewards distinct query terms, proximity, and phrase matches.
Highlighting and clean boundary handling matter almost as much as scoring.
At larger scale, use index token offsets to generate snippets efficiently.