AlgoMaster Logo

Design Bloom Filter

Last Updated: March 8, 2026

Ashish

Ashish Pratap Singh

easy

In this chapter, we will explore the low-level design of bloom filter in detail.

Lets start by clarifying the requirements:

1. Clarifying Requirements

Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.

Here is an example of how a discussion between the candidate and the interviewer might unfold:

Based on the discussion, here’s a summary of the functional requirements:

1.1 Functional Requirements

  • Support add(element) operation: adds a string element to the filter
  • Support mightContain(element) operation: returns true if the element might be in the set, false if it is definitely not
  • Support clear() operation: resets the filter to its initial empty state
  • The filter should compute optimal bit array size and number of hash functions from the caller's parameters
  • The caller should be able to choose which hash function algorithm to use

1.2 Non-Functional Requirements

  • Time Complexity: Both add and mightContain should run in O(k) time, where k is the number of hash functions
  • Space Efficiency: The filter should use significantly less memory than storing all elements explicitly
  • Thread Safety: The implementation must be thread-safe for concurrent add and mightContain operations
  • Configurability: The caller should be able to specify expected elements, desired false positive rate, and hash strategy

Now that we understand what we're building, let's identify the building blocks of our system.

2. Identifying Core Entities

Unlike systems that model real-world concepts like users or bookings, a Bloom filter is a data structure design problem. The core challenge is about choosing the right internal components and computing optimal parameters to achieve the desired probabilistic guarantees.

2.1 The Need for Space-Efficient Membership Testing

"The filter should use significantly less memory than storing all elements explicitly"

The simplest way to test set membership is a HashSet. It gives O(1) lookup and zero false positives. So why not just use one?

Because a HashSet stores every element. If you're tracking 10 million URLs for a safe browsing check, that's hundreds of megabytes of memory.

A Bloom filter can answer the same question, "Is this URL in the blacklist?", using a fraction of the space. The trade-off is that it might occasionally say "yes" when the answer is actually "no" (a false positive), but it will never say "no" when the answer is "yes" (no false negatives).

2.2 How Bit Arrays Enable Membership Testing

The foundation of a Bloom filter is a bit array: a fixed-size array where each position is either 0 or 1.

To add an element, we hash it to produce k different positions in the array and set those bits to 1. To check membership, we hash the element the same way and check if all k positions are 1. If any position is 0, the element was definitely never added. If all are 1, the element is probably in the set, but those bits might have been set by other elements.

2.3 Why Multiple Hash Functions Reduce False Positives

With a single hash function, many different elements map to the same position. The collision rate is high, and false positives are frequent. By using k independent hash functions, each element maps to k different positions. For a false positive to occur, ALL k positions must have been set by other elements. The probability of this happening drops exponentially with k.

2.4 Optimal Parameters

The math behind Bloom filters gives us formulas for the optimal bit array size (m) and number of hash functions (k):

  • Bit array size: m = -(n * ln(p)) / (ln(2))^2
  • Number of hash functions: k = (m / n) * ln(2)

Where n is the expected number of elements and p is the desired false positive probability.

For example, if we want to store 1,000,000 elements with a 1% false positive rate:

  • m = -(1,000,000 * ln(0.01)) / (ln(2))^2 = approximately 9,585,059 bits (about 1.14 MB)
  • k = (9,585,059 / 1,000,000) * ln(2) = approximately 7 hash functions

Compare that to a HashSet storing 1 million strings, which could easily consume 50-100 MB. The Bloom filter uses roughly 1% of the space.

2.5 Entity Overview

Here's how these entities relate to each other:

Scroll
EntityTypeResponsibility
HashStrategyInterfaceDefines the contract for hashing an element to a bit position
MurmurHashStrategyImplementationMurmur3-inspired hashing with good distribution
FNVHashStrategyImplementationFNV-1a hashing, simple and fast
DJB2HashStrategyImplementationDJB2 hashing, lightweight alternative
BitArrayData ClassWraps a fixed-size boolean array with set/get/clear operations
BloomFilterConfigData ClassImmutable configuration holding computed optimal parameters
BloomFilterCore ClassCoordinates hash functions and bit array for probabilistic membership testing
BloomFilter.BuilderBuilderConstructs a BloomFilter with configurable parameters and sensible defaults

These entities form the core abstractions of our Bloom filter. They separate concerns cleanly: hashing is pluggable, configuration is computed once and immutable, and the main class coordinates everything.

With our entities identified, let's define their attributes, behaviors, and relationships.

3. Designing Classes and Relationships

Now that we know what entities we need, let's flesh out their details. For each class, we'll define what data it holds (attributes) and what it can do (methods). Then we'll look at how these classes connect to each other.

3.1 Class Definitions

We'll work bottom-up: simple types first, then interfaces, then implementations, then the main class.

Interfaces

HashStrategy

We need a way to hash elements to bit positions. We could hardcode a single hash algorithm, but different scenarios benefit from different algorithms. A URL blacklist might prioritize speed, while a spell checker might prioritize distribution quality. An interface lets us swap algorithms without changing the BloomFilter class.

HashStrategy defines the contract for hashing an element to a position in the bit array.

MethodDescription
hash(element, seed, bitArraySize)Computes a hash for the given element using the provided seed, and returns a position within [0, bitArraySize)

The seed parameter is what gives us k independent hash values from a single algorithm. Instead of implementing k different hash functions, we call the same function k times with different seeds (0, 1, 2, ..., k-1). Each seed produces a different bit position for the same element.

Core Class

BitArray

The bit array is the storage backbone of the Bloom filter. We could use a raw boolean array directly in BloomFilter, but wrapping it in a dedicated class gives us a clean API for set/get/clear operations and keeps the BloomFilter class focused on coordination rather than low-level bit manipulation.

BitArray wraps a fixed-size array of bits with set, get, and clear operations.

Scroll
AttributeTypeDescriptionMutable?
bitsbool[]The underlying bit storageYes (individual bits are set)
sizeintThe number of bits in the arrayNo
MethodDescription
BitArray(size)Creates a bit array of the given size, all bits initialized to false
set(position)Sets the bit at the given position to true
get(position)Returns whether the bit at the given position is true
clear()Resets all bits to false

BloomFilterConfig

Construction of a Bloom filter requires several related parameters, and some of them (bit array size, number of hash functions) are derived from others (expected elements, false positive rate). We need an object to hold these computed values so the BloomFilter can reference them without recomputing.

BloomFilterConfig is an immutable configuration object that holds both the caller's inputs and the computed optimal parameters.

Scroll
AttributeTypeDescriptionMutable?
expectedElementsintThe number of elements the filter is designed forNo
falsePositiveRatedoubleThe desired probability of false positives (e.g., 0.01 for 1%)No
bitArraySizeintComputed optimal bit array size (m)No
numHashFunctionsintComputed optimal number of hash functions (k)No
MethodDescription
BloomFilterConfig(expectedElements, falsePositiveRate)Computes and stores optimal m and k from the given parameters

All fields are read-only. Once the config is created, it never changes. This makes it safe to share across threads without synchronization.

BloomFilter

This is the main class that ties everything together. It coordinates the hash strategy and bit array to provide the probabilistic membership testing API.

BloomFilter accepts string inputs, uses a configurable hash strategy to compute k bit positions, and manages the bit array.

Scroll
AttributeTypeDescriptionMutable?
configBloomFilterConfigImmutable configuration with optimal parametersNo
bitArrayBitArrayThe bit storage for membership trackingYes (bits are set on add)
hashStrategyHashStrategyThe pluggable hash function algorithmNo
MethodDescription
add(element)Hashes the element k times and sets the corresponding bits
mightContain(element)Hashes the element k times and returns true only if all bits are set
clear()Resets the bit array to all zeros

Key Design Principles:

  1. Single Responsibility: BloomFilter coordinates operations but delegates hashing to HashStrategy and bit storage to BitArray.
  2. Encapsulation: The internal bit array and hash computations are hidden. Callers only see addmightContain, and clear.
  3. Thread Safety: Both add and mightContain should be synchronized to prevent race conditions when multiple threads share a filter.

BloomFilter.Builder

The BloomFilter has several construction parameters: expected elements (required), false positive rate (has a sensible default of 0.01), and hash strategy (has a sensible default of Murmur hashing). Rather than a constructor with many parameters, a Builder lets the caller specify only what they care about and rely on defaults for the rest. It also computes the derived values (bit array size, number of hash functions) automatically.

MethodDescription
Builder(expectedElements)Creates a builder with the required expected elements count
falsePositiveRate(rate)Sets the desired false positive rate (default: 0.01)
hashStrategy(strategy)Sets the hash function algorithm (default: MurmurHashStrategy)
build()Validates parameters, computes optimal m and k, and constructs the BloomFilter

3.2 Class Relationships

Composition (Strong Ownership)

  • BloomFilter owns BitArray: The filter creates and manages the bit array's lifecycle. The bit array doesn't exist outside the filter.
  • BloomFilter owns BloomFilterConfig: The config is created during construction and belongs entirely to the filter.

Association (Uses)

  • BloomFilter uses HashStrategy: The filter calls the hash strategy but doesn't own it. The same strategy instance could theoretically be shared across multiple filters.

Creates

  • BloomFilter.Builder creates BloomFilter: The builder computes optimal parameters and constructs a fully configured filter.

Implements

  • MurmurHashStrategy, FNVHashStrategy, DJB2HashStrategy implement HashStrategy: Each provides a different hashing algorithm behind the same interface.

3.3 Key Design Patterns

This problem falls into the data-structure-heavy category. The core challenge is algorithmic (choosing the right data structure, computing optimal parameters), not behavioral (managing states, coordinating observers). So we keep patterns minimal: two patterns that genuinely earn their place.

Strategy Pattern (Hash Strategy)

The Problem: Different hash algorithms offer different trade-offs between speed, distribution quality, and collision resistance. A URL blacklist checking millions of URLs per second might prioritize speed. A spell checker where false positives are more noticeable might prioritize distribution quality.

The Solution: The Strategy pattern lets us define a family of hash algorithms behind a common interface. The BloomFilter delegates hashing to whatever strategy it's given, without knowing which specific algorithm is being used.

Without it, we'd need to hardcode the hash algorithm or use if-else chains to select one. Every new algorithm would require modifying the BloomFilter class. With Strategy, we add new algorithms by creating new classes, following the Open/Closed Principle.

Builder Pattern (BloomFilter.Builder)

The Problem: Constructing a BloomFilter requires an expected element count (required), a false positive rate (optional, has default), and a hash strategy (optional, has default). On top of that, the bit array size and number of hash functions must be computed from these inputs. A constructor with all these parameters is confusing, and computing derived values inside a constructor violates single responsibility.

The Solution: The Builder pattern separates the construction logic. The caller sets only the parameters they care about, and the builder handles validation, default values, and derived parameter computation.

A telescoping constructor (multiple overloaded constructors) would work but scales poorly. With two optional parameters we'd need four constructors. The Builder makes intent clear: new BloomFilter.Builder(1000000).falsePositiveRate(0.001).build() is self-documenting.

3.4 Full Class Diagram

4. Code Implementation

Now let's translate our design into working code. We'll build bottom-up: foundational types first, then interfaces, then implementations, then the main class. This order matters because each layer depends on the ones below it.

4.1 Hash Strategies

HashStrategy Interface

The interface defines a single method that takes an element, a seed, and the bit array size, and returns a valid position.

The seed parameter is what differentiates hash function 0 from hash function 1, 2, and so on. Each seed produces a different output for the same input element.

MurmurHashStrategy

MurmurHash3 is one of the most widely used non-cryptographic hash functions. It's fast, has excellent distribution, and is the default choice in production Bloom filter implementations like Google Guava.

The multiply-xor-shift pattern creates an avalanche effect: small changes in the input produce dramatically different outputs. The finalization mix ensures the final bits are well-distributed.

FNVHashStrategy

FNV-1a (Fowler-Noll-Vo) is simpler than Murmur but still provides good distribution. It's popular in hash tables and network protocols.

FNV-1a XORs each byte before multiplying (unlike FNV-1 which multiplies first). This "xor-then-multiply" order produces better avalanche characteristics.

DJB2HashStrategy

DJB2 is one of the simplest effective hash functions. Created by Daniel J. Bernstein, it uses the magic constant 33 and is often the go-to choice for quick implementations.

The expression (hash << 5) + hash is equivalent to hash * 33, using bit shifting for speed. The constant 5381 was chosen empirically for its good distribution properties.

4.2 BitArray

The BitArray wraps a boolean array and provides a clean API for bit manipulation. All the methods are straightforward, but having them in a dedicated class keeps the BloomFilter code clean.

The clear() method uses Arrays.fill rather than iterating manually. Both achieve the same result, but Arrays.fill is a single method call that's optimized internally.

4.3 BloomFilterConfig

The config computes optimal parameters from the caller's inputs and stores them as read-only fields. This computation happens exactly once at construction time.

A few things to note:

  • Math.ceil for bit array size: We round up to avoid underestimating, which would increase the actual false positive rate beyond what the caller requested.
  • Math.max(1, ...) for hash functions: Even with extreme parameters, we always use at least one hash function. Zero hash functions would make the filter useless.
  • All fields are final: The config is immutable, safe to share across threads without synchronization.

4.4 BloomFilter

This is the core class that coordinates everything. It delegates hashing to the strategy, stores bits in the array, and provides the public API.

Let's trace through the two core operations:

add(element):

  1. Validate the element is not null
  2. Loop k times (one per hash function), using the loop index as the seed
  3. For each iteration, compute a bit position using the hash strategy
  4. Set that bit to true in the bit array

mightContain(element):

  1. Validate the element is not null
  2. Loop k times with the same seeds used during add
  3. For each iteration, compute the bit position and check if it's set
  4. If ANY bit is not set, return false immediately (the element was definitely never added)
  5. If ALL bits are set, return true (the element is probably in the set)

The early return in mightContain is an important optimization. As soon as we find one unset bit, we know the answer is "definitely not in the set" and can skip the remaining hash computations.

Operation Sequence Diagrams

The following diagrams illustrate what happens during add and mightContain operations:

add(element)

mightContain(element) - Definite No

The first diagram shows the add flow: the filter iterates through all k seeds, computing a position for each and setting the corresponding bit. The second diagram shows a membership check where the second bit position is unset, causing an early return of false. This is the "definitely not in the set" guarantee.

4.6 Demo Code

Let's see the complete system in action.

5. Concurrency and Thread Safety

Bloom filters are commonly shared across threads. Think of a web server where multiple request-handling threads check a URL blacklist, or a database where multiple query threads check a Bloom filter before performing expensive disk lookups. These are real scenarios where concurrent access is expected.

Concern 1: Concurrent add() and mightContain()

In a web server checking a URL blacklist, one thread might be adding newly discovered malicious URLs while another thread is checking whether an incoming request's URL is in the blacklist.

Setup: Thread A adds "malicious.com" (which hashes to positions 3, 7, and 11). Thread B simultaneously checks "malicious.com".

Without synchronization:

  1. Thread A: Calls add("malicious.com"), sets bit at position 3
  2. Thread B: Calls mightContain("malicious.com"), checks position 3 - true
  3. Thread B: Checks position 7 - false (Thread A hasn't set it yet)
  4. Thread B: Returns false
  5. Thread A: Sets bit at position 7
  6. Thread A: Sets bit at position 11

Result: Thread B gets a false negative. "malicious.com" was being added to the filter, but the check returned false because only some of the k bits were set at the time of the check. In a safe browsing scenario, this means a known malicious URL slips through.

With synchronization:

Thread A acquires the lock, sets all three bits atomically, and releases the lock. Thread B then acquires the lock, checks all three bits (all true), and returns true. No false negatives from partial writes.

6. Quiz

Design Bloom Filter Quiz

1 / 20
Multiple Choice

What is the primary reason for using a bit array in the design of a Bloom Filter?