Bloom Filters: A Probabilistic Permission Slip to Skip Work

February 4, 2026


A bloom filter is a probabilistic membership test. You hand it a key, and it answers either "definitely not in the set" or "maybe in the set." It never returns false negatives, only false positives, and the false positive rate is a knob you size at construction time. The structure is a bit array plus K hash functions. On insert, hash the key K times and set those K bits. On query, hash the key K times and check those bits. If any bit is zero, the key is definitely absent. If all are one, the key is probably present, but some other insert may have set those same bits.

The sizing math is well understood. For a target false positive rate of 1 percent at 10 million keys, you need roughly 10 bits per key and 7 hash functions. That is 12 MB to filter a set that would otherwise take hundreds of megabytes to store as a hash set. The payoff is enormous when the alternative is a disk seek or a network call.

Bloom filters are everywhere once you know to look. Every SSTable in an LSM tree has one, so a read can skip files that definitely do not contain the key. CDN caches use them to check existence before fetching from origin. Web crawlers use them to remember which URLs they have already visited. Anti-spam systems use them as URL blacklists. The pattern is always the same: turn an expensive check into a cheap memory probe, and accept a small false positive rate as the cost.

Deletes are the awkward part. You cannot just unset bits, because those bits might be shared with other keys that are still in the set. Counting bloom filters replace each bit with a small counter, incrementing on insert and decrementing on delete, at the cost of extra memory. LSM trees sidestep the problem by rebuilding the bloom filter from scratch during compaction.

The production failure I have watched twice now: a team sized a bloom filter for 10 million keys at 1 percent false positive rate, hardcoded at deploy time. The keyspace grew to 100 million over a year. False positive rate climbed to roughly 63 percent because the bit array saturated with ones. The filter was returning "maybe" for almost every query, so every read still hit disk. The team only noticed because read latency crept up week over week. The fix: rebuild bloom filters at actual cardinality during compaction, and move to a scalable bloom filter, a series of filters with growing capacity, so the structure absorbs growth without resizing the world.

A bloom filter is a permission slip to skip work. Size it for the keyspace you actually have, not the one you launched with.

Key takeaway

A bloom filter buys you a cheap maybe-or-no answer about set membership, sized by bits per key and target false positive rate. The catch is that the keyspace must stay near the size you planned for, or the optimization collapses.

Originally posted on LinkedIn. View original.


All Rights Reserved.