Last Updated: March 8, 2026
A Bloom Filter is a probabilistic data structure used to test whether an element is a member of a set. It is designed to be very fast and extremely space-efficient, especially when working with large volumes of data where memory is constrained.
Loading simulation...
However, it comes with one trade-off:
In this chapter, we will explore the low-level design of bloom filter in detail.
Lets start by clarifying the requirements:
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a discussion between the candidate and the interviewer might unfold:
Candidate: "What type of elements will the Bloom filter store? Are we working with strings, integers, or arbitrary objects?"
Interviewer: "Let's keep it simple and work with strings. That covers most practical use cases like URLs, usernames, and email addresses."
Candidate: "What operations should the Bloom Filter support? Just add and mightContain, or do we also need to support deletion"
Interviewer: "Just add and mightContain. Deletion is not required in a standard Bloom Filter."
Candidate: "What false positive rate is acceptable? Should the caller be able to configure this?"
Interviewer: "Yes, the caller should be able to specify a desired false positive rate. A reasonable default would be 1%."
Candidate: "Should the filter be thread-safe? Could multiple threads add elements or check membership concurrently?"
Interviewer: " Yes, assume it will be used in a multi-threaded environment, like multiple request handlers checking a URL blacklist."
Candidate: "Should we allow the caller to choose which hash function algorithm to use?"
Interviewer: "That would be nice. Different use cases might benefit from different hash function characteristics."
Based on the discussion, here’s a summary of the functional requirements:
add(element) operation: adds a string element to the filtermightContain(element) operation: returns true if the element might be in the set, false if it is definitely notclear() operation: resets the filter to its initial empty stateadd and mightContain should run in O(k) time, where k is the number of hash functionsadd and mightContain operationsNow that we understand what we're building, let's identify the building blocks of our system.
Unlike systems that model real-world concepts like users or bookings, a Bloom filter is a data structure design problem. The core challenge is about choosing the right internal components and computing optimal parameters to achieve the desired probabilistic guarantees.
"The filter should use significantly less memory than storing all elements explicitly"
The simplest way to test set membership is a HashSet. It gives O(1) lookup and zero false positives. So why not just use one?
Because a HashSet stores every element. If you're tracking 10 million URLs for a safe browsing check, that's hundreds of megabytes of memory.
A Bloom filter can answer the same question, "Is this URL in the blacklist?", using a fraction of the space. The trade-off is that it might occasionally say "yes" when the answer is actually "no" (a false positive), but it will never say "no" when the answer is "yes" (no false negatives).
The foundation of a Bloom filter is a bit array: a fixed-size array where each position is either 0 or 1.
To add an element, we hash it to produce k different positions in the array and set those bits to 1. To check membership, we hash the element the same way and check if all k positions are 1. If any position is 0, the element was definitely never added. If all are 1, the element is probably in the set, but those bits might have been set by other elements.
With a single hash function, many different elements map to the same position. The collision rate is high, and false positives are frequent. By using k independent hash functions, each element maps to k different positions. For a false positive to occur, ALL k positions must have been set by other elements. The probability of this happening drops exponentially with k.
The math behind Bloom filters gives us formulas for the optimal bit array size (m) and number of hash functions (k):
Where n is the expected number of elements and p is the desired false positive probability.
For example, if we want to store 1,000,000 elements with a 1% false positive rate:
Compare that to a HashSet storing 1 million strings, which could easily consume 50-100 MB. The Bloom filter uses roughly 1% of the space.
Here's how these entities relate to each other:
These entities form the core abstractions of our Bloom filter. They separate concerns cleanly: hashing is pluggable, configuration is computed once and immutable, and the main class coordinates everything.
In an actual interview, you are not expected to know or implement specific hash algorithms like MurmurHash, FNV, or DJB2. These are included here for completeness.
In an interview, a simpler approach works just as well: use a basic polynomial hash function with different seeds (e.g., hash = hash * seed + char for each character).
What matters is demonstrating that you understand the Strategy pattern, the role of seeds in producing independent hash functions, and how the hash output maps to a bit position. The specific algorithm is not what the interviewer is evaluating.
With our entities identified, let's define their attributes, behaviors, and relationships.
Now that we know what entities we need, let's flesh out their details. For each class, we'll define what data it holds (attributes) and what it can do (methods). Then we'll look at how these classes connect to each other.
We'll work bottom-up: simple types first, then interfaces, then implementations, then the main class.
HashStrategyWe need a way to hash elements to bit positions. We could hardcode a single hash algorithm, but different scenarios benefit from different algorithms. A URL blacklist might prioritize speed, while a spell checker might prioritize distribution quality. An interface lets us swap algorithms without changing the BloomFilter class.
HashStrategy defines the contract for hashing an element to a position in the bit array.
The seed parameter is what gives us k independent hash values from a single algorithm. Instead of implementing k different hash functions, we call the same function k times with different seeds (0, 1, 2, ..., k-1). Each seed produces a different bit position for the same element.
BitArrayThe bit array is the storage backbone of the Bloom filter. We could use a raw boolean array directly in BloomFilter, but wrapping it in a dedicated class gives us a clean API for set/get/clear operations and keeps the BloomFilter class focused on coordination rather than low-level bit manipulation.
BitArray wraps a fixed-size array of bits with set, get, and clear operations.
BloomFilterConfigConstruction of a Bloom filter requires several related parameters, and some of them (bit array size, number of hash functions) are derived from others (expected elements, false positive rate). We need an object to hold these computed values so the BloomFilter can reference them without recomputing.
BloomFilterConfig is an immutable configuration object that holds both the caller's inputs and the computed optimal parameters.
All fields are read-only. Once the config is created, it never changes. This makes it safe to share across threads without synchronization.
BloomFilterThis is the main class that ties everything together. It coordinates the hash strategy and bit array to provide the probabilistic membership testing API.
BloomFilter accepts string inputs, uses a configurable hash strategy to compute k bit positions, and manages the bit array.
add, mightContain, and clear.add and mightContain should be synchronized to prevent race conditions when multiple threads share a filter.BloomFilter.BuilderThe BloomFilter has several construction parameters: expected elements (required), false positive rate (has a sensible default of 0.01), and hash strategy (has a sensible default of Murmur hashing). Rather than a constructor with many parameters, a Builder lets the caller specify only what they care about and rely on defaults for the rest. It also computes the derived values (bit array size, number of hash functions) automatically.
This problem falls into the data-structure-heavy category. The core challenge is algorithmic (choosing the right data structure, computing optimal parameters), not behavioral (managing states, coordinating observers). So we keep patterns minimal: two patterns that genuinely earn their place.
The Problem: Different hash algorithms offer different trade-offs between speed, distribution quality, and collision resistance. A URL blacklist checking millions of URLs per second might prioritize speed. A spell checker where false positives are more noticeable might prioritize distribution quality.
The Solution: The Strategy pattern lets us define a family of hash algorithms behind a common interface. The BloomFilter delegates hashing to whatever strategy it's given, without knowing which specific algorithm is being used.
Without it, we'd need to hardcode the hash algorithm or use if-else chains to select one. Every new algorithm would require modifying the BloomFilter class. With Strategy, we add new algorithms by creating new classes, following the Open/Closed Principle.
The Problem: Constructing a BloomFilter requires an expected element count (required), a false positive rate (optional, has default), and a hash strategy (optional, has default). On top of that, the bit array size and number of hash functions must be computed from these inputs. A constructor with all these parameters is confusing, and computing derived values inside a constructor violates single responsibility.
The Solution: The Builder pattern separates the construction logic. The caller sets only the parameters they care about, and the builder handles validation, default values, and derived parameter computation.
A telescoping constructor (multiple overloaded constructors) would work but scales poorly. With two optional parameters we'd need four constructors. The Builder makes intent clear: new BloomFilter.Builder(1000000).falsePositiveRate(0.001).build() is self-documenting.
Now let's translate our design into working code. We'll build bottom-up: foundational types first, then interfaces, then implementations, then the main class. This order matters because each layer depends on the ones below it.
The interface defines a single method that takes an element, a seed, and the bit array size, and returns a valid position.
The seed parameter is what differentiates hash function 0 from hash function 1, 2, and so on. Each seed produces a different output for the same input element.
MurmurHash3 is one of the most widely used non-cryptographic hash functions. It's fast, has excellent distribution, and is the default choice in production Bloom filter implementations like Google Guava.
The multiply-xor-shift pattern creates an avalanche effect: small changes in the input produce dramatically different outputs. The finalization mix ensures the final bits are well-distributed.
FNV-1a (Fowler-Noll-Vo) is simpler than Murmur but still provides good distribution. It's popular in hash tables and network protocols.
FNV-1a XORs each byte before multiplying (unlike FNV-1 which multiplies first). This "xor-then-multiply" order produces better avalanche characteristics.
DJB2 is one of the simplest effective hash functions. Created by Daniel J. Bernstein, it uses the magic constant 33 and is often the go-to choice for quick implementations.
The expression (hash << 5) + hash is equivalent to hash * 33, using bit shifting for speed. The constant 5381 was chosen empirically for its good distribution properties.
The BitArray wraps a boolean array and provides a clean API for bit manipulation. All the methods are straightforward, but having them in a dedicated class keeps the BloomFilter code clean.
The clear() method uses Arrays.fill rather than iterating manually. Both achieve the same result, but Arrays.fill is a single method call that's optimized internally.
The config computes optimal parameters from the caller's inputs and stores them as read-only fields. This computation happens exactly once at construction time.
A few things to note:
This is the core class that coordinates everything. It delegates hashing to the strategy, stores bits in the array, and provides the public API.
Let's trace through the two core operations:
The early return in mightContain is an important optimization. As soon as we find one unset bit, we know the answer is "definitely not in the set" and can skip the remaining hash computations.
The following diagrams illustrate what happens during add and mightContain operations:
The first diagram shows the add flow: the filter iterates through all k seeds, computing a position for each and setting the corresponding bit. The second diagram shows a membership check where the second bit position is unset, causing an early return of false. This is the "definitely not in the set" guarantee.
Let's see the complete system in action.
Bloom filters are commonly shared across threads. Think of a web server where multiple request-handling threads check a URL blacklist, or a database where multiple query threads check a Bloom filter before performing expensive disk lookups. These are real scenarios where concurrent access is expected.
In a web server checking a URL blacklist, one thread might be adding newly discovered malicious URLs while another thread is checking whether an incoming request's URL is in the blacklist.
Setup: Thread A adds "malicious.com" (which hashes to positions 3, 7, and 11). Thread B simultaneously checks "malicious.com".
add("malicious.com"), sets bit at position 3mightContain("malicious.com"), checks position 3 - truefalseResult: Thread B gets a false negative. "malicious.com" was being added to the filter, but the check returned false because only some of the k bits were set at the time of the check. In a safe browsing scenario, this means a known malicious URL slips through.
Thread A acquires the lock, sets all three bits atomically, and releases the lock. Thread B then acquires the lock, checks all three bits (all true), and returns true. No false negatives from partial writes.
What is the primary reason for using a bit array in the design of a Bloom Filter?