Distributed Counters and Top-K Approximations

Course

System Design Fundamentals

Topics Covered

Counting at Scale: Why Distributed Counters

The Single Counter Bottleneck

Sharded Counters

Hierarchical Aggregation

Challenges in Maintaining Accurate Distributed Counts

Lost Updates During Failures

Double Counting

Stale Reads

Consistency vs Availability

Exact vs Approximate Counting

When Exact Counting Fails

The Accuracy-Memory Trade-off

When to Use Exact vs Approximate

Probabilistic Data Structures for Counting

HyperLogLog (HLL)

Count-Min Sketch (CMS)

When to Use Which

Approaches to Top-K: Frequent Elements and Heavy Hitters

Exact Top-K (HashMap + Heap)

Space-Saving Algorithm

Count-Min Sketch + Heap (Approximate Top-K)

Comparison

Putting It All Together: Choosing Accuracy vs Performance

Decision Framework

Architecture Patterns

Common Interview Question Patterns

Common Mistakes

Counting at Scale: Why Distributed Counters

Counting is trivial on a single machine: increment an integer. But at global scale, a single counter becomes a bottleneck that limits your entire system's throughput. Every like, page view, and API call needs to be counted, and millions of these events happen per second across servers in multiple regions.

The Single Counter Bottleneck

Consider a "like" counter on a viral social media post. With a single counter row in a database:

sql

-- Every like from every user worldwide hits this one row
UPDATE posts SET like_count = like_count + 1 WHERE post_id = 'viral_post';

This creates three problems:

Write contention: Every increment acquires a row lock. With 10,000 likes per second, transactions queue behind each other. Lock wait time dominates, and throughput plateaus regardless of how powerful the database is.
Single point of failure: If the database node holding this row fails, no likes can be recorded until failover completes.
Cross-region latency: Users in Tokyo incrementing a counter in US-East pay 150ms round-trip per like. At scale, this latency is unacceptable.

Sharded Counters

The fundamental solution: split one logical counter into N physical shards. Each write goes to a random shard. Reads aggregate all shards.

python

1import random
2
3NUM_SHARDS = 16
4
5def increment(counter_id):
6    """Write to a random shard — no contention."""
7    shard = random.randint(0, NUM_SHARDS - 1)
8    key = f"counter:{counter_id}:shard:{shard}"
9    redis.incr(key)
10
11def get_count(counter_id):
12    """Read from all shards and sum."""
13    total = 0
14    for shard in range(NUM_SHARDS):
15        key = f"counter:{counter_id}:shard:{shard}"
16        val = redis.get(key)
17        total += int(val or 0)
18    return total

With 16 shards, write contention drops by 16x. Each shard receives roughly 1/16th of the writes. The trade-off: reads must fan out to all shards and sum, which is slower than reading a single value. For counters that are written frequently but read occasionally (like counts, view counts), this trade-off is excellent.

Hierarchical Aggregation

For global-scale counting (billions of events across data centers), sharded counters within one region combine with cross-region aggregation:

Edge: Each application server buffers increments in memory for a short interval (1-5 seconds)
Region: Buffered counts flush to a regional aggregator (Kafka + Flink, or a regional Redis)
Global: Regional totals are periodically merged into a global store

This is eventually consistent: the global total lags behind reality by a few seconds. But for counters where exact real-time accuracy is not critical (view counts, like counts), the latency-consistency trade-off is worth the massive throughput gain.

Key Insight

The core insight of distributed counting is that a single counter creates a serialization point: every writer waits for every other writer. Sharding breaks this serialization by giving each writer its own counter to increment without contention. The read-time aggregation cost is the price you pay for write scalability.

Challenges in Maintaining Accurate Distributed Counts

Distributing counters introduces accuracy challenges that do not exist with a single counter. Every design choice trades accuracy for something else: throughput, latency, availability, or memory.

Lost Updates During Failures

When a server buffers increments in memory and crashes before flushing, those increments are lost. If 500 increments were buffered and the server dies, the count is permanently 500 lower than reality.

Mitigations:

Shorter flush intervals: Reduce the buffer window from 5 seconds to 100ms. Limits maximum loss to 100ms of traffic. Cost: more frequent network calls.
Write-ahead log: Append each increment to a local WAL before buffering. On recovery, replay the WAL. Cost: disk I/O per increment.
Accept the loss: For non-critical counters (page views, impressions), losing a few hundred counts during a rare crash is acceptable. The operational simplicity of not having a WAL may be worth more than perfect accuracy.

Double Counting

The opposite problem: a server flushes its buffer, the aggregator processes it, but the acknowledgment is lost. The server retries, and the same batch is counted twice.

Mitigations:

Idempotent writes: Assign each batch a unique ID. The aggregator deduplicates by batch ID before applying. Cost: additional state for dedup tracking.
At-least-once with reconciliation: Accept occasional double-counting, then periodically reconcile against a source of truth (e.g., count distinct events in a data warehouse).

Stale Reads

With buffered writes and regional aggregation, the global count always lags behind reality:

Actual count:   1,000,000
Global store:     999,200  (800 increments still in flight)
User sees:        999,200  (or cached value from 2 seconds ago)

For most counters, this is fine. YouTube shows "1.2M views", and nobody expects that to be exact to the unit. But for some use cases (inventory count, remaining seats), staleness causes real problems (overselling).

Consistency vs Availability

Under a network partition between regions:

Choose consistency: Stop accepting increments in the partitioned region until connectivity is restored. The count is accurate but the service is degraded.
Choose availability: Continue counting locally in each region. When the partition heals, merge the counts. The count may temporarily show different values in different regions.

Most counting systems choose availability because the cost of an inaccurate count (off by a few percent temporarily) is far lower than the cost of refusing to count (broken user experience, lost data).

Note that the failure modes are asymmetric: lost updates make the count too low (undercounting), while retry-related double counting makes it too high (overcounting). Most systems bias toward overcounting because undercounting loses real data permanently, while overcounting can be corrected by periodic reconciliation against a source of truth.

Exact vs Approximate Counting

The previous sections assumed you need the actual count. But many counting problems do not require exact numbers. They need answers to questions like "roughly how many unique users visited today?" or "which URLs are accessed most frequently?" Approximate counting trades bounded error for dramatic reductions in memory and computation.

When Exact Counting Fails

Counting the exact number of unique visitors to a website requires storing every unique identifier you have seen:

python

1# Exact unique count: store every visitor
2unique_visitors = set()
3
4def count_visitor(visitor_id):
5    unique_visitors.add(visitor_id)
6
7def get_unique_count():
8    return len(unique_visitors)
9
10# Problem: 100 million unique visitors = 100M entries in a set
11# At 36 bytes per UUID: ~3.6 GB of memory per counter

For one counter this is manageable. For 10,000 counters (one per URL, one per campaign, one per region), you need 36 TB of memory. This is where approximate counting becomes essential.

The Accuracy-Memory Trade-off

Approach	Memory per Counter	Accuracy	Use Case
Exact (HashSet)	O(n) where n = unique items	100% exact	Financial transactions, inventory
HyperLogLog	~12 KB (fixed)	Within 0.81% error	Unique visitor counts, cardinality
Count-Min Sketch	O(width x depth)	Overcounts by bounded amount	Frequency estimation, trending topics
Bloom Filter	O(n) bits	False positives only	"Have I seen this before?"

The key insight: HyperLogLog counts 1 billion unique items in 12 KB of memory with 0.81% error. An exact HashSet would need roughly 36 GB. That is a 3-million-fold memory reduction for less than 1% error.

When to Use Exact vs Approximate

Use exact counting when:

The count has financial implications (billing, revenue, inventory)
Off-by-one errors cause business logic failures (rate limiting, quota enforcement)
The dataset is small enough that exact counting is cheap

Use approximate counting when:

The count is displayed to users (view counts, like counts, "1.2M subscribers")
The count drives analytics or recommendations (trending topics, popular items)
The dataset is too large for exact counting to be feasible (unique visitors across billions of events)

Interview Tip

In system design interviews, the choice between exact and approximate counting is a key decision point. State the trade-off explicitly: 'For view counts we can use HyperLogLog at 12KB per counter with less than 1% error. For inventory counts we need exact atomic decrements because selling more items than we have causes real-world shipping failures.' This shows you understand when approximation is acceptable.

Probabilistic Data Structures for Counting

Two probabilistic data structures dominate large-scale counting: HyperLogLog for cardinality estimation ("how many unique items?") and Count-Min Sketch for frequency estimation ("how often does each item appear?"). Understanding their internals reveals why they work and when they fail.

HyperLogLog (HLL)

HyperLogLog estimates the number of distinct elements in a dataset. The core idea: hash each element to a binary string, and track the maximum number of leading zeros observed. More leading zeros imply more distinct elements (like flipping coins: the more flips, the more likely you are to see a long streak of heads).

python

1import hashlib
2import math
3
4class HyperLogLog:
5    def __init__(self, num_buckets=16384):  # 2^14 = 16384 buckets
6        self.p = int(math.log2(num_buckets))  # 14 bits for bucket index
7        self.buckets = [0] * num_buckets
8
9    def add(self, element):
10        h = int(hashlib.sha256(element.encode()).hexdigest(), 16)
11        # First p bits determine the bucket
12        bucket = h & ((1 << self.p) - 1)
13        # Remaining bits: count leading zeros + 1
14        remaining = h >> self.p
15        leading_zeros = 0
16        while remaining > 0 and (remaining & 1) == 0:
17            leading_zeros += 1
18            remaining >>= 1
19        self.buckets[bucket] = max(self.buckets[bucket], leading_zeros + 1)
20
21    def count(self):
22        # Harmonic mean of 2^bucket_value across all buckets
23        m = len(self.buckets)
24        alpha = 0.7213 / (1 + 1.079 / m)  # Bias correction
25        raw = alpha * m * m / sum(2 ** -b for b in self.buckets)
26        return int(raw)
27
28# Usage: 12KB for 16384 buckets, counts billions of uniques
29hll = HyperLogLog()
30for user_id in stream_of_user_ids:
31    hll.add(user_id)
32print(f"Unique users: {hll.count()}")  # ~0.81% error

Redis has built-in HLL support with PFADD and PFCOUNT:

1PFADD daily_visitors:2026-03-06 "user_123" "user_456" "user_789"
2PFCOUNT daily_visitors:2026-03-06
3-- Returns: (integer) 3
4
5-- Merge multiple HLLs (e.g., daily into weekly)
6PFMERGE weekly_visitors daily_visitors:03-01 daily_visitors:03-02 ...

HLL's killer feature: PFMERGE combines multiple HLLs into one. You can count unique visitors per hour, then merge hourly HLLs into daily, weekly, and monthly counts without double-counting users who visited multiple times.

Count-Min Sketch (CMS)

Count-Min Sketch estimates how often a specific element appears. It uses D independent hash functions, each mapping to one of W positions in an array:

python

1import hashlib
2
3class CountMinSketch:
4    def __init__(self, width=1000, depth=5):
5        self.width = width
6        self.depth = depth
7        self.table = [[0] * width for _ in range(depth)]
8
9    def _hashes(self, element):
10        """Generate D independent hash values."""
11        hashes = []
12        for i in range(self.depth):
13            h = hashlib.md5(f"{i}:{element}".encode()).hexdigest()
14            hashes.append(int(h, 16) % self.width)
15        return hashes
16
17    def increment(self, element, count=1):
18        for i, pos in enumerate(self._hashes(element)):
19            self.table[i][pos] += count
20
21    def estimate(self, element):
22        """Return minimum across all hash positions (least overcounted)."""
23        return min(
24            self.table[i][pos]
25            for i, pos in enumerate(self._hashes(element))
26        )
27
28# Track frequency of URLs across billions of requests
29cms = CountMinSketch(width=10000, depth=7)
30for url in stream_of_requests:
31    cms.increment(url)
32print(f"/api/users frequency: {cms.estimate('/api/users')}")

The CMS never undercounts: the true count is always less than or equal to the estimate. Hash collisions cause other elements to increment the same positions, inflating the estimate. Taking the minimum across D arrays reduces this inflation because a collision in all D arrays simultaneously is unlikely.

When to Use Which

"How many unique X?" -> HyperLogLog (cardinality)
"How often does X appear?" -> Count-Min Sketch (frequency)
"Has X been seen before?" -> Bloom Filter (membership)

Key Insight

HyperLogLog and Count-Min Sketch solve fundamentally different counting problems. HLL answers 'how many distinct items?' (cardinality); it cannot tell you which items or how often. CMS answers 'how often does this specific item appear?' (frequency); it cannot tell you the total number of distinct items. Confusing the two is a common interview mistake.

Approaches to Top-K: Frequent Elements and Heavy Hitters

Finding the top-K most frequent elements in a data stream is one of the most common analytics problems. Trending topics, popular products, heavy-hitter detection for abuse prevention: all require identifying which elements appear most often without storing the entire stream.

Exact Top-K (HashMap + Heap)

The straightforward approach: count every element in a hashmap, maintain a min-heap of size K.

python

1import heapq
2from collections import defaultdict
3
4class ExactTopK:
5    def __init__(self, k):
6        self.k = k
7        self.counts = defaultdict(int)
8
9    def add(self, element):
10        self.counts[element] += 1
11
12    def top_k(self):
13        return heapq.nlargest(self.k, self.counts.items(), key=lambda x: x[1])
14
15# Problem: 1 billion distinct URLs = 1 billion hashmap entries
16# Memory: ~60 GB (URL string + count per entry)

This works for small cardinalities (thousands of distinct elements) but fails when the number of distinct elements is large.

Space-Saving Algorithm

Space-Saving keeps only K counters and guarantees that the true top-K elements are among them:

python

1class SpaceSaving:
2    def __init__(self, k):
3        self.k = k
4        self.counters = {}  # element -> count
5        self.min_count = 0
6
7    def add(self, element):
8        if element in self.counters:
9            self.counters[element] += 1
10        elif len(self.counters) < self.k:
11            self.counters[element] = 1
12        else:
13            # Replace the element with the minimum count
14            min_elem = min(self.counters, key=self.counters.get)
15            min_val = self.counters.pop(min_elem)
16            self.counters[element] = min_val + 1
17
18    def top_k(self):
19        return sorted(self.counters.items(), key=lambda x: -x[1])

Space-Saving uses O(K) memory regardless of stream size. The trade-off: counts for elements that replaced a previous minimum are overestimated (they inherit the evicted element's count + 1). But the true heavy hitters (elements with genuinely high frequency) are rarely evicted because their counts are always above the minimum.

Count-Min Sketch + Heap (Approximate Top-K)

Combine CMS for frequency estimation with a min-heap for tracking the current top-K:

python

1import heapq
2
3class CMSTopK:
4    def __init__(self, k, cms_width=10000, cms_depth=7):
5        self.k = k
6        self.cms = CountMinSketch(width=cms_width, depth=cms_depth)
7        self.heap = []  # min-heap of (count, element)
8        self.in_heap = set()
9
10    def add(self, element):
11        self.cms.increment(element)
12        estimated_count = self.cms.estimate(element)
13
14        if element in self.in_heap:
15            # Update existing element in heap
16            self._update_heap(element, estimated_count)
17        elif len(self.heap) < self.k:
18            heapq.heappush(self.heap, (estimated_count, element))
19            self.in_heap.add(element)
20        elif estimated_count > self.heap[0][0]:
21            # New element beats current minimum
22            _, evicted = heapq.heapreplace(
23                self.heap, (estimated_count, element))
24            self.in_heap.discard(evicted)
25            self.in_heap.add(element)
26
27    def top_k(self):
28        return sorted(self.heap, key=lambda x: -x[0])

This uses CMS memory (fixed, independent of stream size) plus K heap entries. The CMS's overcounting bias means some elements may appear more frequent than they are, potentially displacing a true top-K element. In practice, the overcounting is small relative to heavy hitter counts, so the top-K result is accurate for high-frequency elements.

Comparison

Algorithm	Memory	Accuracy	Stream-Friendly
Exact (HashMap)	O(n distinct)	Exact	No (unbounded)
Space-Saving	O(K)	Guaranteed top-K present	Yes
CMS + Heap	O(CMS size + K)	Approximate (CMS overcounting)	Yes

Putting It All Together: Choosing Accuracy vs Performance

Every counting system sits on a spectrum from "perfectly accurate but expensive" to "approximately accurate but cheap." The right position on this spectrum depends on your business requirements, not your technical preferences.

Decision Framework

Ask three questions about each counter:

What is the cost of being wrong? An inventory count off by 1 causes a fulfillment failure. A view count off by 1,000 is invisible. The higher the cost, the more you should invest in accuracy.
What is the cardinality of the counted set? Counting 1,000 unique items is cheap with exact methods. Counting 1 billion unique items requires probabilistic approaches unless you have terabytes of memory.
What is the query pattern? Write-heavy counters (billions of increments, occasional reads) benefit from sharding and buffering. Read-heavy counters (displayed on every page load) benefit from caching the aggregated value.

Architecture Patterns

Pattern 1: Sharded counter + cache (social metrics)

Write: random_shard(post_id) -> INCR shard -> done
Read:  cache.get(post_id) || SUM(all shards) -> cache.set(ttl=2s)

Best for: like counts, view counts, follower counts. Writes are cheap and distributed. Reads are cached and approximate.

Pattern 2: HLL per time window (unique counting)

Write: PFADD hll:daily:2026-03-06 user_id
Read:  PFCOUNT hll:daily:2026-03-06
Merge: PFMERGE hll:weekly hll:daily:03-01 ... hll:daily:03-07

Best for: DAU/MAU, unique visitors per page, unique events per campaign. Fixed memory per counter, mergeable across time windows.

Pattern 3: CMS + heap (trending/heavy hitters)

Write: cms.increment(element)
Read:  heap.top_k() -> return current top-K
Decay: every hour, halve all CMS counters

Best for: trending topics, popular products, heavy-hitter API clients. Identifies high-frequency elements without tracking all elements.

Pattern 4: Exact atomic counter (financial/inventory)

Write: DECR inventory:product_123
       if result < 0: INCR inventory:product_123; return OUT_OF_STOCK
Read:  GET inventory:product_123

Best for: inventory, account balances, rate limit counters. No approximation. Single point of serialization accepted because correctness is non-negotiable.

Common Interview Question Patterns

Interviewers test whether you can match the counting approach to the use case:

"Design a view counter for YouTube" -> Sharded counter + cache (write-heavy, approximate OK)
"Track unique visitors for analytics" -> HyperLogLog (cardinality, fixed memory)
"Find trending topics on Twitter" -> CMS + heap (frequency + top-K, streaming)
"Track inventory for e-commerce" -> Exact atomic counter (correctness required)

Common Mistakes

Using HyperLogLog for frequency. HLL answers "how many unique?" not "how often does X appear?" Use Count-Min Sketch for frequency estimation. These two structures solve fundamentally different problems.

Using sharded counters for inventory. Sharded counters cannot enforce global constraints like "total must not go negative." Individual shards have no visibility into other shards. Use atomic counters for inventory where correctness is non-negotiable.

Ignoring cache staleness. A cached aggregate that is 5 seconds stale is fine for likes, but causes overselling for inventory. Always match your caching strategy to the accuracy requirement of the business domain.

Over-engineering non-critical counters. Page view counts do not need WAL, idempotency, or strong consistency. Accept occasional loss and keep the system simple. The engineering cost of perfect accuracy often exceeds the business value of the counter. Reserve strong consistency for counters where errors have real-world consequences.

The right question is always: "what happens if this count is wrong?" If the answer is "nothing visible to users," use the simplest approach.

Interview Tip

In system design interviews, always state the accuracy requirement before choosing the counting approach. Say: 'View counts can tolerate eventual consistency and approximation, so I will use sharded counters with a cached aggregate. Inventory counts require strong consistency, so I will use an atomic Redis counter with no buffering.' This demonstrates that you choose tools based on requirements, not habit.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios