Last Updated: May 26, 2026
Many systems need to compare sets without comparing every element.
MinHash is a probabilistic technique for estimating Jaccard similarity between sets. It creates compact signatures that let systems compare documents, users, queries, or item sets efficiently.
It is useful when the question is not "how many distinct items exist?" but "how similar are these two sets?"
Suppose a system has millions of documents and needs to find near-duplicates. Comparing every pair directly is too expensive.
For two sets, Jaccard similarity is:
If two documents share many shingles, their Jaccard similarity is high. If they share few shingles, it is low.
The exact calculation requires the full sets. MinHash compresses each set into a fixed-size signature that approximates this similarity.