How to map hashfunction output to bloomfilter indices?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you "definitely not in the set" or "probably in the set," but never gives a false negative. The key operation that makes a Bloom filter work is mapping the output of hash functions to specific bit positions in a fixed-size bit array.
Understanding how this mapping works, and how to do it efficiently, is essential for implementing Bloom filters correctly and tuning their false positive rate.
How Modulo Mapping Works
A hash function produces a large integer from an input element. To turn that integer into a valid index in a bit array of size m, you take the modulo:
For example, if your bit array has 10 slots and hash("apple") returns 485051, the index is 485051 % 10 = 1. You set bit 1 to 1.
A Bloom filter uses k independent hash functions. Each one produces a different index, and you set all k corresponding bits. To check membership, you hash the query element with the same k functions and verify that every resulting bit is set to 1. If any bit is 0, the element is definitely not in the set.
Here is a concrete example with m = 10 and k = 3:
Notice that hash_1 and hash_2 both mapped to index 1. This is a collision, and it is expected behavior. Collisions contribute to false positives but do not affect correctness.
Double Hashing Optimization
Computing k completely independent hash functions can be expensive. A well-known optimization called double hashing uses only two base hash functions, h1 and h2, and derives all k indices from them with the formula:
Kirsch and Mitzenmacher (2006) proved that this technique preserves the theoretical false positive guarantees of a standard Bloom filter while cutting the number of hash computations down to just two, regardless of k.
Python Implementation
Here is a complete Bloom filter implementation using double hashing with Python's built-in hashlib:
The _get_indices method is where the double hashing happens. It computes MD5 as h1 and SHA-256 as h2, then derives k indices using the linear combination formula.
False Positive Rate Formula
The probability of a false positive in a Bloom filter after inserting n elements into a bit array of size m with k hash functions is:
This formula assumes that the hash functions distribute bits uniformly. From it you can derive the optimal number of hash functions for a given m and n:
And the optimal bit array size for a target false positive rate p with n expected elements:
For example, storing 1000 elements with a 1% false positive rate requires a bit array of roughly 9,585 bits (about 1.2 KB) and 7 hash functions.
Common Pitfalls
- Choosing a bit array that is too small. If
mis too small relative ton, the bit array saturates quickly and the false positive rate climbs far above the target. Always use the optimal size formula. - Using correlated hash functions. If your
khash functions produce similar outputs for the same input, they tend to set the same bits, which increases false positives and wastes capacity. Ensure independence or use the double hashing technique. - Forgetting that Bloom filters do not support deletion. Setting a bit to 0 to remove one element could unset a bit shared by another element, causing false negatives. Use a Counting Bloom filter if you need deletions.
- Ignoring the modulo bias. When
mis not a power of two, the modulo operation introduces a slight bias toward lower indices. In practice this is negligible for largem, but for small arrays you should be aware of it. - Treating "probably in the set" as "definitely in the set." A positive result from a Bloom filter must always be confirmed against the authoritative data source if correctness matters.
Summary
- A Bloom filter maps hash function outputs to bit array positions using the modulo operation:
index = hash(element) % m. - Double hashing (
h1(x) + i * h2(x)) % m) lets you derivekindices from only two hash functions with no loss in false positive guarantees. - The false positive rate depends on three variables: bit array size
m, number of hash functionsk, and number of inserted elementsn. - Use the optimal size and hash count formulas to configure the filter for your target false positive rate.
- Bloom filters cannot support deletion without a counting variant, and positive results should always be verified against the source of truth.

