Analytics systems often need to answer questions like:

How many unique users visited today?
How many different devices called this API?
How many unique search queries did we see this week?

The exact answer can be expensive when the data is spread across shards, time windows, tenants, and experiments.

HyperLogLog makes this cheaper. It estimates the number of distinct items using a small, fixed-size summary called a sketch. Instead of moving raw user IDs around, systems can merge sketches from many producers.

The trade-off is exactness. HyperLogLog gives an approximate count, not an exact one. It also cannot check whether a specific item exists, list the items it has seen, or delete one item.

This chapter explains the core idea behind HyperLogLog, how registers work, how sketches merge, and where it fits in production.

1. The Counting Problem

The exact approach is a set:

This is exact, but memory grows with the number of unique items.

The real memory cost is usually much higher than the raw ID size. A set stores the keys, hash-table bookkeeping, spare capacity, object overhead, and sometimes the string contents themselves. A billion distinct IDs can easily become many gigabytes.

Now multiply that by dimensions such as page, tenant, country, device type, hour, experiment variant, and retention period.

Exact sets become expensive quickly.

Question	Exact Set Cost	HyperLogLog Cost
Daily unique users	Grows with unique users	Fixed by sketch precision
Weekly unique users	Requires deduping all days	Merge daily sketches
Unique users per page	One set per page	One sketch per page
Global unique users across shards	Shuffle IDs or sets	Merge sketches

HyperLogLog is useful when you only need the count, not the exact identities behind the count.

2. The Core Idea

Premium Content

This content is for premium members only.

1. The Counting Problem

2. The Core Idea

Premium Content

Get Premium