\`Hash\` Function
Bloom Filter
Index Mapping
Data Structures
Algorithms

How to map hashfunction output to bloomfilter indices?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you "definitely not in the set" or "probably in the set," but never gives a false negative. The key operation that makes a Bloom filter work is mapping the output of hash functions to specific bit positions in a fixed-size bit array.

Understanding how this mapping works, and how to do it efficiently, is essential for implementing Bloom filters correctly and tuning their false positive rate.

How Modulo Mapping Works

A hash function produces a large integer from an input element. To turn that integer into a valid index in a bit array of size m, you take the modulo:

 
index = hash(element) % m

For example, if your bit array has 10 slots and hash("apple") returns 485051, the index is 485051 % 10 = 1. You set bit 1 to 1.

A Bloom filter uses k independent hash functions. Each one produces a different index, and you set all k corresponding bits. To check membership, you hash the query element with the same k functions and verify that every resulting bit is set to 1. If any bit is 0, the element is definitely not in the set.

Here is a concrete example with m = 10 and k = 3:

 
1hash_1("apple") = 485051  ->  485051 % 10 = 1
2hash_2("apple") = 32841   ->  32841  % 10 = 1
3hash_3("apple") = 912584  ->  912584 % 10 = 4
4
5Bit array before: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
6Bit array after:  [0, 1, 0, 0, 1, 0, 0, 0, 0, 0]

Notice that hash_1 and hash_2 both mapped to index 1. This is a collision, and it is expected behavior. Collisions contribute to false positives but do not affect correctness.

Double Hashing Optimization

Computing k completely independent hash functions can be expensive. A well-known optimization called double hashing uses only two base hash functions, h1 and h2, and derives all k indices from them with the formula:

 
g_i(x) = (h1(x) + i * h2(x)) % m    for i = 0, 1, ..., k-1

Kirsch and Mitzenmacher (2006) proved that this technique preserves the theoretical false positive guarantees of a standard Bloom filter while cutting the number of hash computations down to just two, regardless of k.

Python Implementation

Here is a complete Bloom filter implementation using double hashing with Python's built-in hashlib:

python
1import hashlib
2import math
3
4class BloomFilter:
5    def __init__(self, expected_items: int, fp_rate: float = 0.01):
6        # Calculate optimal bit array size and hash count
7        self.m = self._optimal_size(expected_items, fp_rate)
8        self.k = self._optimal_hash_count(self.m, expected_items)
9        self.bit_array = [0] * self.m
10
11    def _optimal_size(self, n: int, p: float) -> int:
12        """Calculate optimal bit array size m."""
13        return int(-n * math.log(p) / (math.log(2) ** 2))
14
15    def _optimal_hash_count(self, m: int, n: int) -> int:
16        """Calculate optimal number of hash functions k."""
17        return max(1, int((m / n) * math.log(2)))
18
19    def _get_indices(self, item: str) -> list[int]:
20        """Compute k indices using double hashing."""
21        h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
22        h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
23        return [(h1 + i * h2) % self.m for i in range(self.k)]
24
25    def add(self, item: str) -> None:
26        for idx in self._get_indices(item):
27            self.bit_array[idx] = 1
28
29    def might_contain(self, item: str) -> bool:
30        return all(self.bit_array[idx] == 1
31                   for idx in self._get_indices(item))
32
33# Usage
34bf = BloomFilter(expected_items=1000, fp_rate=0.01)
35bf.add("apple")
36bf.add("banana")
37
38print(bf.might_contain("apple"))   # True
39print(bf.might_contain("cherry"))  # Almost certainly False
40print(f"Bit array size: {bf.m}, Hash functions: {bf.k}")

The _get_indices method is where the double hashing happens. It computes MD5 as h1 and SHA-256 as h2, then derives k indices using the linear combination formula.

False Positive Rate Formula

The probability of a false positive in a Bloom filter after inserting n elements into a bit array of size m with k hash functions is:

 
p = (1 - e^(-k * n / m))^k

This formula assumes that the hash functions distribute bits uniformly. From it you can derive the optimal number of hash functions for a given m and n:

 
k_optimal = (m / n) * ln(2)

And the optimal bit array size for a target false positive rate p with n expected elements:

 
m_optimal = -(n * ln(p)) / (ln(2))^2

For example, storing 1000 elements with a 1% false positive rate requires a bit array of roughly 9,585 bits (about 1.2 KB) and 7 hash functions.

Common Pitfalls

  • Choosing a bit array that is too small. If m is too small relative to n, the bit array saturates quickly and the false positive rate climbs far above the target. Always use the optimal size formula.
  • Using correlated hash functions. If your k hash functions produce similar outputs for the same input, they tend to set the same bits, which increases false positives and wastes capacity. Ensure independence or use the double hashing technique.
  • Forgetting that Bloom filters do not support deletion. Setting a bit to 0 to remove one element could unset a bit shared by another element, causing false negatives. Use a Counting Bloom filter if you need deletions.
  • Ignoring the modulo bias. When m is not a power of two, the modulo operation introduces a slight bias toward lower indices. In practice this is negligible for large m, but for small arrays you should be aware of it.
  • Treating "probably in the set" as "definitely in the set." A positive result from a Bloom filter must always be confirmed against the authoritative data source if correctness matters.

Summary

  • A Bloom filter maps hash function outputs to bit array positions using the modulo operation: index = hash(element) % m.
  • Double hashing (h1(x) + i * h2(x)) % m) lets you derive k indices from only two hash functions with no loss in false positive guarantees.
  • The false positive rate depends on three variables: bit array size m, number of hash functions k, and number of inserted elements n.
  • Use the optimal size and hash count formulas to configure the filter for your target false positive rate.
  • Bloom filters cannot support deletion without a counting variant, and positive results should always be verified against the source of truth.

Course illustration
Course illustration

All Rights Reserved.