Last Updated: January 22, 2026
A Bloom Filter is a probabilistic data structure used to test whether an element is a member of a set. It is designed to be very fast and extremely space-efficient, especially when working with large volumes of data where memory is constrained.
Loading simulation...
However, it comes with one trade-off:
In this chapter, we will explore the low-level design of bloom filter in detail.
Lets start by clarifying the requirements:
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a discussion between the candidate and the interviewer might unfold:
Candidate: Should the Bloom Filter support generic types (e.g., any object), or should we limit it to specific data types like strings or integers
Interviewer: Let’s assume for this design that we’re working with strings.
Candidate: What operations should the Bloom Filter support? Just add and mightContain, or do we also need to support deletion
Interviewer: Just add and mightContain. Deletion is not required in a standard Bloom Filter.
Candidate: What kind of false result is acceptable? Are we okay with false positives?
Interviewer: Yes, that’s expected. A Bloom Filter can return false positives, but it should never return false negatives.
Candidate: Should the number of hash functions (k) and size of the bit array (m) be configurable or fixed?
Interviewer: Ideally, they should be configurable at initialization. The values may vary depending on the expected number of elements and acceptable false positive rate.
Candidate: What should be the behavior if the same element is added multiple times?
Interviewer: That’s fine. Bloom Filters are idempotent. Adding the same element again should have no effect on correctness.
Candidate: Should the design support concurrency? For example, multiple threads calling add and mightContain?
Interviewer: Yes. Assume the Bloom Filter will be accessed from a multi-threaded environment.
Based on the discussion, here’s a summary of the functional requirements:
k (number of hash functions) and m (size of the bit array).add and mightContain operations should be optimized for speed, ideally executing in O(1) time.Now that we understand what we're building, let's identify the building blocks of our system.
How do you go from a list of requirements to actual classes? The key is to look for nouns in the requirements that have distinct attributes or behaviors.
The design of a Bloom Filter is different from typical object-oriented systems. Instead of modeling real-world entities, we focus on choosing the right abstractions for a probabilistic data structure.
Let's walk through our requirements and identify what needs to exist in our system.
"Support
add(element)andmightContain(element)"
The fundamental requirement is space efficiency. Since we don't store actual elements, we need a compact representation. This leads us to a bit array, where each element is a single bit (0 or 1).
A BitSet uses approximately 1 bit per entry, while boolean[] typically uses 1 byte (8 bits) per entry due to JVM memory alignment. For a filter with 1 million bits, that's 125KB vs 1MB.
"Support multiple hash function implementations"
To populate and check the bit array, we need a way to map each element to bit positions. This is the role of our hash functions.
A Bloom Filter uses k independent hash functions, each mapping a string to a bit array index between 0 and m-1. The same hash functions must be used in both add() and mightContain() operations.
This suggests two abstractions:
"The filter should be configurable and thread-safe"
Finally, we need a public-facing class that ties everything together. The BloomFilter class:
add() and mightContain() methods"Support configurable bit array size (m) and number of hash functions (k)"
Creating a Bloom Filter requires several parameters that must be coordinated. We use:
Here's how these entities relate to each other:
We've identified several types of entities:
Enums define fixed sets of values for type safety.
Interfaces define contracts that implementations must follow.
Strategy Implementations provide the actual hash function logic.
Factory encapsulates strategy creation logic.
Core Classes contain the main Bloom Filter logic.
With our entities identified, let's define their attributes, behaviors, and relationships.
Now that we know what entities we need, let's flesh out their details. For each class, we'll define what data it holds (attributes) and what it can do (methods). Then we'll look at how these classes connect to each other.
We'll work bottom-up: simple types first, then interfaces, then implementations, then the main class.
HashType identifies the available hash function algorithms.
Why use an enum instead of strings?
Because HashType.FNV1A is self-documenting and type-safe. You can't accidentally pass "FNV1a" (wrong case) or "FNV-1a" (wrong format). The compiler catches invalid types at compile time.
HashStrategyDefines the contract for hash functions.
The interface is intentionally minimal. It takes a string and returns a long. The Bloom Filter will take the absolute value and modulo by the bit array size to get the actual index.
FNV1aHashStrategyImplements the FNV-1a 64-bit hash algorithm.
FNV-1a is fast, simple, and has good distribution for small strings. The algorithm XORs each byte with the hash, then multiplies by the prime.
DJB2HashStrategyImplements the DJB2 hash algorithm.
DJB2 uses the formula hash = hash * 33 + c for each character. It's known for simplicity and relatively low collision rates.
HashStrategyFactoryCreates hash strategy instances from enum types.
The factory decouples strategy instantiation from client code. Clients don't need to know class names like FNV1aHashStrategy. They just pass HashType.FNV1A.
BloomFilterThis is the main class that exposes the public API (add and mightContain) and orchestrates the behavior of the internal components.
BloomFilter.BuilderProvides fluent configuration.
The Builder validates all parameters and ensures no invalid BloomFilter is ever created.
Understanding how these classes interact clarifies system behavior and responsibilities.
We apply a few classic design principles and patterns to keep the system modular, flexible, and testable.
The Problem: A Bloom Filter needs multiple hash functions, and different use cases might prefer different algorithms. If we hardcode hash logic directly in the BloomFilter, we can't swap algorithms without modifying the filter class. Adding a new hash function would require changing existing code.
The Solution: The Strategy pattern encapsulates each hash algorithm in its own class. The BloomFilter holds a list of HashStrategy implementations and uses them interchangeably.
Why This Pattern: We could use a simple switch statement, but the Strategy pattern gives us:
We pass a list of strategies rather than generating multiple hashes from a single strategy. This gives maximum flexibility. Each strategy can use a completely different algorithm. If you wanted multiple hashes from the same algorithm with different seeds, you could create a SeededHashStrategy decorator.
The Problem: Creating hash strategies requires knowing concrete class names like FNV1aHashStrategy. Clients shouldn't need to know these implementation details. If we rename a class or change its package, all client code breaks.
The Solution: The Factory pattern provides a single creation point. Clients pass an enum value, and the factory returns the appropriate implementation.
Why This Pattern: A factory is simpler than full dependency injection but still provides decoupling. It's appropriate here because:
The Problem: Creating a BloomFilter requires several parameters that must be coordinated: bit array size, number of hash functions, and the actual strategies. A constructor with all these parameters would be hard to read and easy to misuse.
The Solution: The Builder pattern provides a fluent API for step-by-step configuration. The build() method validates everything before creating the filter.
Why This Pattern: The Builder is appropriate when:
Now let's translate our design into working code. We'll build bottom-up: foundational types first, then interfaces, then implementations, then the main class. This order matters because each layer depends on the ones below it.
We start with the enum that other classes depend on.
Simple and clean. Just two values representing our supported hash algorithms.
Now we define the contract that our hash function classes will implement.
The interface is minimal by design. A hash function takes a string and returns a long. The BloomFilter handles converting this to a bit array index. This separation of concerns keeps the strategies focused on hashing only.
Each strategy encapsulates a specific hash algorithm.
The FNV-1a algorithm is elegant in its simplicity. For each byte, we XOR it with the current hash, then multiply by the prime. The magic numbers come from the FNV specification and are chosen to provide good distribution.
DJB2 uses the classic formula hash * 33 + c. The bit shift (hash << 5) + hash is an efficient way to compute hash * 33. This algorithm is known for its simplicity and decent performance.
Each strategy is independently testable. You can unit test FNV1aHashStrategy without creating a BloomFilter. Just create the strategy, call hash(), and verify the output.
The factory decouples strategy creation from client code.
The factory is a static method because it has no state. Clients call HashStrategyFactory.create(HashType.FNV1A) without needing to instantiate the factory.
The default case might seem unnecessary since we handle all enum values. It's defensive programming, a safety net if someone adds a new enum value but forgets to update the factory.
This is the main class where everything comes together. Let's build it piece by piece.
The constructor is private. All BloomFilter instances must be created through the Builder. This ensures validation always happens.
Notice all fields are final. Once a BloomFilter is created, its configuration never changes. The only mutable state is the bit array itself, and we control access to that through synchronized methods.
The add method is synchronized for thread safety. Multiple threads can call add concurrently without corrupting the bit array.
For each hash function, we:
The mightContain method uses the same index calculation. If any bit is false, the element was definitely never added (no false negatives). If all bits are true, the element might have been added, or we might have a false positive from hash collisions.
We use synchronized on both methods rather than more granular locking. For a Bloom Filter, this is usually fine because operations are fast (O(k) where k is small). If profiling showed contention, we could use a lock-free approach with AtomicIntegerArray, but that adds complexity.
Now let's look at the Builder:
Each with* method validates its input and returns this for chaining. The build method performs cross-field validation before constructing the filter.
The critical validation is strategies.size() < numHashFunctions. If you configure 5 hash functions but only provide 2 strategies, that's a bug. We catch it at build time, not when add throws an IndexOutOfBoundsException.
Let's see the complete system in action.
The following diagram illustrates what happens when an element is added:
What is the primary reason for using a bit array in the design of a Bloom Filter?