Algorithms for compression of set tries

Data Compression

Set Tries

Algorithms

Computational Efficiency

Data Structures

Algorithms for compression of set tries

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A set trie stores many sets while sharing common prefixes, which makes lookup and subset operations efficient. The downside is memory growth when the dataset contains many near-duplicate branches. Compression algorithms reduce node count and pointer overhead without losing query correctness.

Set Trie Basics and Cost Drivers

In a set trie, each path represents an ordered set. If two sets share early elements, their paths share nodes. The structure is efficient for insertion and membership tests, but space usage can still become large when branching is wide.

The main memory costs are:

Node objects and allocator overhead.
Child maps or arrays for outgoing edges.
Repeated suffixes across different branches.

Compression methods target one or more of these costs while keeping operations like exact lookup, subset checks, and iteration practical.

Path Compression

Path compression merges chains of single-child nodes into one edge label. Instead of storing one element per edge, you store a short sequence. This reduces pointers and improves cache locality.

python

1class Node:
2    def __init__(self):
3        self.children = {}          # key: tuple label, value: Node
4        self.terminal = False
5
6
7def insert_compressed(root, sorted_set):
8    current = root
9    i = 0
10    while i < len(sorted_set):
11        matched = None
12        for label, child in current.children.items():
13            # Find longest common prefix with existing compressed edge.
14            j = 0
15            while j < len(label) and i + j < len(sorted_set) and label[j] == sorted_set[i + j]:
16                j += 1
17            if j > 0:
18                matched = (label, child, j)
19                break
20
21        if matched is None:
22            current.children[tuple(sorted_set[i:])] = Node()
23            current = current.children[tuple(sorted_set[i:])]
24            i = len(sorted_set)
25        else:
26            label, child, prefix_len = matched
27            if prefix_len < len(label):
28                # Split the edge when only part of the label matches.
29                split = Node()
30                split.children[label[prefix_len:]] = child
31                del current.children[label]
32                current.children[label[:prefix_len]] = split
33                current = split
34            else:
35                current = child
36            i += prefix_len
37
38    current.terminal = True

Path compression works best when many sets have long deterministic runs.

Tail Merging with Directed Acyclic Word Graph Style Deduplication

Many branches end with identical suffixes. Tail merging identifies structurally equal subtrees and reuses one copy. The result is a directed acyclic graph rather than a pure tree.

A common implementation strategy is bottom-up hashing:

Build a signature from terminal flag plus child signatures.
Keep a table from signature to canonical node.
Replace duplicate nodes with the canonical instance.

This approach can shrink memory significantly for datasets with recurring endings.

python

1def canonicalize(node, memo):
2    canonical_children = {}
3    for label, child in sorted(node.children.items()):
4        canonical_children[label] = canonicalize(child, memo)
5
6    signature = (
7        node.terminal,
8        tuple((label, id(child)) for label, child in canonical_children.items()),
9    )
10
11    if signature in memo:
12        return memo[signature]
13
14    node.children = canonical_children
15    memo[signature] = node
16    return node

For production code, replace id with a deterministic index generated during canonicalization.

Succinct Encoding for Static Tries

If the trie is mostly read-only, succinct encodings often outperform pointer-heavy layouts. Examples include bit-vector encodings and level-order unary degree sequence techniques. These trade mutation speed for compact representation and fast rank-select traversal.

In practical systems, teams often combine techniques:

Build mutable trie in memory.
Apply path compression.
Canonicalize tails.
Serialize to compact arrays for read-heavy serving.

That staged approach keeps build logic simple while making query-time memory predictable.

Common Pitfalls

One mistake is compressing without preserving element order rules. Set tries usually require a canonical order of elements, and violating it can break lookup semantics.

Another issue is using expensive child maps for tiny branching factors. For low fan-out nodes, a small sorted vector can be faster and smaller than a hash map.

Developers also forget to benchmark mutation cost. Tail merging and canonicalization are excellent for static datasets, but frequent incremental updates can become expensive.

Finally, signatures must include every property that affects equality. If terminal markers are omitted, distinct states may collapse incorrectly and cause false positives during queries.

Summary

Path compression reduces pointer overhead by merging single-child chains.
Tail merging deduplicates repeated suffixes into a directed acyclic structure.
Succinct encodings are strong for static, read-heavy workloads.
Compression choices should match update frequency and query patterns.
Correct canonical ordering and subtree signatures are essential for correctness.