AlgoMaster Logo

HashSet<T>

High Priority31 min readUpdated June 6, 2026

HashSet<T> is the .NET collection for storing a group of unique values with fast membership tests. It uses a hash table internally, so checking whether an item is already in the set, adding a new item, or removing one all take constant time on average. The trade-off is that the set doesn't preserve any order and won't accept duplicates: a second Add of the same value is silently rejected.

Why a Set Instead of a List

A List<T> fits when you care about order, you might have duplicates, or you need positional access by index. Asking "is this email already in the list?" forces the runtime to scan every entry until it finds a match or runs off the end. For a cart of five items it doesn't matter. For ten thousand customer emails it does.

HashSet<T> flips the priority. It discards ordering and indexing in exchange for O(1) average-case Contains, Add, and Remove. Building a deduplication step on a long list turns an O(n^2) loop into O(n).

Two distinct emails went in, the duplicate ada@shop.com was ignored, and Contains answered both questions in constant time. There is no index, no seenEmails[0], no "the first email I added." The set knows what's inside and nothing more.

HashSet<T> trades a small per-item memory overhead (the hash table buckets and entry records) for O(1) average lookup. If you only need to iterate and don't need fast Contains, a List<T> is cheaper. Use HashSet<T> when membership tests or deduplication are the point.

HashSet<T> lives in System.Collections.Generic, the same namespace as List<T> and Dictionary<TKey, TValue>. Under .NET 6 and later, that namespace is part of the implicit global usings for new console projects, so the using System.Collections.Generic; directive is not required. The code samples in this lesson include it explicitly so the snippets work even outside a default template.

Constructing a HashSet

Four constructors cover most cases. The default constructor creates an empty set with the standard equality rules for T. Passing an integer pre-sizes the hash table so it doesn't have to grow as you add. Passing an IEqualityComparer<T> swaps in custom equality (covered later in this lesson). Passing any IEnumerable<T> builds the set from an existing sequence, deduplicating as it goes.

The fromList set started from four strings but contains three, because the second "SKU-1" was a duplicate the constructor dropped. Count always returns the number of unique values currently stored, not the number of values that were ever offered.

The capacity constructor is a hint, not a hard limit. The set can grow beyond it. Pre-sizing matters when you know roughly how many items you're going to add, because growing a hash table requires allocating a larger bucket array and re-hashing every existing entry into it. Doing that once during construction is cheaper than doing it a few times as the set fills up.

Each resize re-hashes every entry. If you're about to add 100,000 items and you don't pre-size, the table will resize several times as it grows. Passing capacity: 100000 to the constructor avoids that work.

There's also an IEnumerable<T> plus comparer overload, used to deduplicate a sequence under custom equality. Building a case-insensitive set from a list of emails is a common shape:

Four input strings collapsed to two unique customers because the comparer treats "Ada@Shop.com" and "ada@shop.com" as the same value. The set keeps the first form it saw for each equivalence class, which is why the output shows Ada@Shop.com rather than ada@shop.com. The custom comparer changes the rules for the whole life of the set: every subsequent Add, Contains, and Remove uses it too.

Collection expressions (C# 12) also work, though there's a subtlety. The target type matters because the compiler picks the constructor based on the variable's declared type:

The collection expression ["new", "sale", "limited", "new"] produces a HashSet<string> because the declared type on the left says so. The duplicate "new" is dropped at construction time. If you'd written var tags = ["new", "sale", ...], the compiler wouldn't know which collection type you meant and would refuse to infer one.

Add Returns a Bool

Add is the standard way to put a value into the set. The return value is a bool: the set tells you whether the value was actually inserted. true means "this was new and I added it." false means "I already had this, nothing changed."

The third call returned false because "SKU-101" was already in the set. The set didn't throw and didn't replace anything; it reported back that no insertion occurred. Count stays at 2.

That return value lets you write deduplication logic without an explicit Contains check, which is both shorter and faster (one hash lookup instead of two):

The if (!seenOrderIds.Add(orderId)) line does both the check and the insertion in one shot. If the order is new, Add returns true, the negation is false, and processing continues. If the order is a repeat, Add returns false, the negation is true, and the early return fires. This is the canonical "process each thing once" pattern in C#, and it's a strict improvement over the two-step if (!set.Contains(x)) { set.Add(x); ... }.

A combined Add+Contains is one hash lookup, not two. Use Add's return value when you'd otherwise check before inserting. The runtime is the same, the code is shorter, and there's no window between the check and the insert.

The set's equality rules decide what counts as "already there." For value types and strings, the default rules are usually correct. For your own classes, the default is reference equality, which means two different objects with the same content count as distinct. The "Custom Equality" section later in this lesson shows how to change that.

Contains, Remove, Count, Clear

These four members cover the rest of the basic set API. Contains answers a yes/no membership question. Remove deletes one specific value and returns whether it was there. Count reports the current size. Clear empties the set.

Contains returns true only for values that are present according to the set's equality rules. Remove returns true if it actually removed something and false if the value wasn't there to begin with. Count is a property, not a method, and it's already maintained as items go in and come out, so reading it is O(1). Clear empties the table but keeps the internal bucket array, so the set is ready to accept new items without re-allocating right away.

The collection initializer in new HashSet<string> { "Coffee Maker", "Headphones", "Notebook" } is shorthand for an empty set plus three Add calls. It uses the same deduplication rules as everything else, so writing { "x", "x" } in an initializer ends up with one "x" in the set, not two.

Contains is O(1) on average. The worst case is O(n) when every value hashes to the same bucket, but that almost never happens with standard equality on strings, integers, or value types. A custom GetHashCode that returns a constant produces the worst case quickly.

Removing while iterating throws. HashSet<T>, like the other generic collections, doesn't allow mutation during a foreach. To remove a batch of items, either collect them first and remove afterward, or use RemoveWhere:

RemoveWhere walks the set once, applies the predicate to each item, and removes the matches. It returns the number of items removed. This is the way to do bulk deletion. Writing a foreach that calls Remove inside the loop body fails at runtime with InvalidOperationException saying the collection was modified.

The substring sku["SKU-".Length..] is a range expression: it takes everything from index 4 to the end of the string. Combined with int.Parse, it pulls the numeric part out of "SKU-101" so the predicate can check whether it's even. Ranges arrived in C# 8 and apply here because the prefix length is fixed.

Iteration Order is Unspecified

A foreach over a HashSet<T> visits every element exactly once, but the order it visits them in is an implementation detail. It is not insertion order. It is not sorted. It can change between .NET versions, and it can change between two sets that received the same inserts in a different sequence.

Possible output:

This particular run happens to print insertion order on .NET 8 because the hash table buckets fall out that way for these specific strings. That's a coincidence, not a guarantee. Inserting the same four products on a different machine, or on a different .NET version, or after a few additions and removals that perturb the bucket layout, can produce a different sequence. Production code must never depend on it.

If you need a specific order, use a different collection or sort the set's contents at the point you need them:

OrderBy returns an IEnumerable<string> sorted in ascending order, so the foreach walks the sorted projection rather than the underlying set. The set itself is unchanged. When this is done on every read, SortedSet<T> is a better fit because it keeps the elements ordered all the time at the cost of O(log n) operations instead of O(1).

OrderBy on a HashSet<T> allocates a temporary sorted projection on each call. To sort the same set repeatedly, cache the result, or switch to a collection that keeps order.

The same rule applies to ToList(), ToArray(), and any other materialization. They preserve whatever order the source enumerator happened to produce, which is the bucket-walk order for a HashSet<T>. The result is deterministic for that particular set instance at that particular moment, but it isn't a meaningful order to rely on.

Set Algebra: Mutating Methods

A set type is not only for deduplication; it also supports set algebra. HashSet<T> has four methods for the standard binary set operations, and they all mutate the set in place.

MethodEquivalentWhat it does
UnionWith(other)A ∪ BAdds every element of other not already in this.
IntersectWith(other)A ∩ BKeeps only elements also in other.
ExceptWith(other)A − BRemoves every element of other from this.
SymmetricExceptWith(other)A ▵ BKeeps elements in exactly one of the two, not both.

Each takes an IEnumerable<T>, so the argument doesn't have to be another HashSet<T>. Passing a list, array, or any sequence works fine.

Output (order within each line may vary):

Each operation starts by copying cartA into a fresh set, then mutates the copy. UnionWith and friends modify the receiver. Running cartA.UnionWith(cartB) directly would cause the original cartA to gain SKU-104 and SKU-105, which is rarely the goal when computing a derived view. Copy first, mutate second.

The same relationship drawn as a Venn diagram, with the four results highlighted by which region each one selects:

The diagram emphasizes a useful frame: each set operation is a choice of which Venn regions to keep. Union keeps all three. Intersect keeps the center. Except keeps the left lobe. Symmetric except keeps both lobes but drops the center.

In an e-commerce flow, these operations come up constantly. "Which SKUs are in both the customer's wishlist and the current sale?" is wishlist.IntersectWith(saleSkus). "What does my cart look like if I merge in the recommended items?" is cart.UnionWith(recommended). "Which items in the cart are out of stock right now?" is cart.IntersectWith(outOfStockSkus). Set algebra replaces nested foreach loops for these questions.

Output (order may vary):

Two items from the wishlist intersect with the current sale. The wishlist has four items and the sale has four, and the answer is the overlap. The implementation is two lines (copy + intersect), and the cost is O(n) in the smaller set: for each item in the smaller set, the runtime does one O(1) lookup in the larger set.

All four mutating set operations take an IEnumerable<T>. When the argument is a HashSet<T> with the same comparer, the runtime optimizes the operation to walk the smaller side and do O(1) lookups in the larger. When the argument is an arbitrary sequence, it falls back to iterating the sequence and doing the lookups on each item. Both are O(n) in total, but the path matters when the sequence is expensive to iterate (a database query, a streamed file).

The mutating methods can also filter incrementally. Given a running set of "products the customer might want" and a new constraint, calling IntersectWith with the constraint set narrows the working set in place without any intermediate allocations.

Set Algebra: Read-Only Methods

The other six set methods don't mutate anything. They answer yes/no questions about how two sets relate.

MethodTrue when
IsSubsetOf(other)Every element of this is in other.
IsSupersetOf(other)Every element of other is in this.
IsProperSubsetOf(other)this is a subset of other AND they're not equal.
IsProperSupersetOf(other)this is a superset of other AND they're not equal.
Overlaps(other)At least one element is in both.
SetEquals(other)Both sets contain exactly the same elements.

The "proper" variants add the strict requirement that the two sides aren't equal. A set is a subset of itself, but it isn't a proper subset of itself. This distinction matches the standard mathematical one and shows up in tests like "does the customer have strictly more items in their cart than the discount requires?"

The cart contains everything the discount requires, plus an extra item. IsSupersetOf is true because every element of the smaller set is in the cart. IsProperSupersetOf is also true because the cart isn't equal to the discount set (it has the extra cookie). SetEquals is false because the two sets aren't identical, and Overlaps is true because at least one element appears in both.

SetEquals deserves a closer look. It compares contents, not references. Two sets with the same elements in any internal order are equal. Two sets that look identical in their constructor calls are still distinct objects, but SetEquals will report them equal.

ReferenceEquals is false because cartA and cartB are two different objects on the heap. SetEquals is true because both objects hold the same three string values. The order they were declared in doesn't matter, and the underlying bucket order doesn't matter either. Use SetEquals when the question is "do these two sets describe the same group?".

All six read-only operations are at worst O(n + m) where n and m are the sizes of the two sets. Most of them short-circuit. Overlaps can return as soon as it finds one common element, and IsSubsetOf can return false as soon as it finds an element of this that's not in other. Passing the smaller set as this tends to be faster.

These methods are also useful as guards. Before applying a "buy three of category X, get one free" discount, the cart needs to contain three items from category X. That is a superset test, not a count loop. Before merging two recommendation feeds, the code checks whether they have any overlap at all. That is Overlaps. The methods read like the question being asked.

Custom Equality with IEqualityComparer&lt;T&gt;

The set's default equality rules come from T's Equals and GetHashCode methods. For strings, that's case-sensitive ordinal equality. For value types and records, that's structural equality on all the members. For your own classes, that's reference equality by default, which means two different Customer objects with the same email count as distinct unless the class overrides equality.

When the default isn't desired, pass an IEqualityComparer<T> to the constructor. The set will use it for every operation: Add, Contains, Remove, all the set algebra methods, everything. The contract for IEqualityComparer<T> itself is covered in detail in the _IComparer & IEqualityComparer_ lesson. The short version: it's an interface with two methods, Equals(x, y) for comparison and GetHashCode(x) for bucket placement, and both have to agree (if Equals says two values are equal, GetHashCode must produce the same hash for them).

Case-insensitive string comparison is the most common case, and the BCL ships a ready-made comparer:

The set was constructed with StringComparer.OrdinalIgnoreCase, so all three casings of Ada's email map to the same hash and compare equal. The second Add returns false because the comparer treats "ADA@shop.com" as already-present, and Contains returns true for any casing of the same address. The set keeps the first form it saw.

StringComparer has several built-in members:

ComparerBehavior
StringComparer.OrdinalCase-sensitive, byte-by-byte. The default.
StringComparer.OrdinalIgnoreCaseCase-insensitive, byte-by-byte. Fast.
StringComparer.CurrentCultureUses the current locale's collation rules.
StringComparer.CurrentCultureIgnoreCaseLocale-aware and case-insensitive.
StringComparer.InvariantCultureLocale-independent culture-aware comparison.
StringComparer.InvariantCultureIgnoreCaseSame but case-insensitive.

For SKUs, email addresses, and any other "user-supplied identifier" string, OrdinalIgnoreCase is almost always correct. The culture-aware comparers are for human-language text where things like Turkish dotted-vs-dotless I matter.

For your own classes, write the comparer. The following comparer treats two Customer objects as equal when their emails match case-insensitively:

The two Customer instances are different objects with different Name values, but the comparer only looks at Email. The second Add returns false because, according to the comparer, the customer is already in the set. The set keeps the first instance it saw, including the original name.

The two methods on the comparer have to agree. If Equals(a, b) is true, then GetHashCode(a) must equal GetHashCode(b). The contract is what lets the hash table find a candidate in O(1): the runtime first jumps to the bucket determined by GetHashCode, then walks that bucket calling Equals to find an actual match. If two equal objects produce different hashes, the runtime will look in the wrong bucket and report the value as absent. The _IComparer & IEqualityComparer_ lesson walks through this contract and the common ways to break it.

GetHashCode runs on every Add, Contains, and Remove. A comparer whose GetHashCode is expensive (allocates, calls into a database, walks a deep object graph) makes the whole set slow. Keep it cheap. For strings, the built-in StringComparer types use highly tuned implementations.

For records, a comparer is usually unnecessary. C# generates structural Equals and GetHashCode based on the record's positional parameters, and the default rules already treat two records with the same field values as equal:

The record Sku(string Code) gives Sku value-like equality. Two Sku instances with the same Code are considered equal, and GetHashCode is implemented in lockstep. The set treats them as the same value without any explicit comparer. This is one of the reasons records are a popular choice for identifier-like types in modern C#.

Inside the Hash Table

The set's performance comes from its internal layout. Knowing roughly how it works helps with two things: predicting cost when writing something unusual, and understanding why operations occasionally slow down (resize) or behave oddly (custom comparers).

A HashSet<T> stores its entries in an array of "buckets," each of which is the head of a short chain of entries. When you call Add(value), the runtime does roughly this:

  1. Compute hash = comparer.GetHashCode(value).
  2. Find a bucket index by taking hash % bucketCount.
  3. Walk the chain at that bucket. For each existing entry, call comparer.Equals(entry, value). If any returns true, the value is already present, return false.
  4. If no match was found, append a new entry to the chain and return true.

Contains and Remove follow the same first three steps and stop when they find a match. Because well-distributed hashes spread entries evenly across buckets, the average chain length stays close to one, and each operation costs O(1).

The flow makes it clear where each part of the comparer contract matters. GetHashCode decides which bucket to look in, so a bad hash distribution sends too many entries to the same bucket and slows everything down. Equals decides whether a chain entry is the target value, so a wrong Equals will either fail to find existing entries (false negatives, leading to duplicates) or wrongly identify unrelated entries as matches (false positives, leading to lost data).

When the set fills up, the bucket array resizes. The runtime allocates a larger array (typically the next prime above twice the current size), recomputes the bucket index for every existing entry, and copies them into the new array. This is the expensive step that capacity pre-sizing avoids. After the resize, operations resume at O(1) average.

HashSet<T> is a hash table, with all the strengths and trade-offs that implies: fast for the common case, dependent on a reasonable hash distribution, and order-free.

Common Patterns

A few patterns come up often enough to call out by name.

Deduplicating a sequence. Pass an IEnumerable<T> to the constructor and take the count, or iterate the resulting set. This is the standard one-liner for "how many unique values are in this list?"

The constructor walks the array once, adds each value to the hash table, and drops duplicates. Count reports the unique total. For a long log, this is dramatically cheaper than a nested loop or a sort-then-scan, and the case-insensitive comparer handles the typical "Ada@Shop.com vs ada@shop.com" problem at the same time.

Fast membership testing in a hot path. Build the set once, query it many times. A common example is "is this SKU on the discount list?" running inside a checkout loop.

The membership test inside the loop is constant time. If discountedSkus were a List<string>, the same check would scan from the beginning every iteration, turning the whole loop from O(n) into O(n * m) where m is the size of the discount list. With small numbers the difference is unnoticeable. With a thousand cart items and ten thousand discount SKUs, the difference is roughly 10,000,000 comparisons versus 1,000.

Finding common elements without nested loops. When the question is "what's in both?", the answer is IntersectWith, not a doubly-nested foreach.

Output (order may vary):

The intersection takes O(n) in the smaller set, with each lookup in the larger set being O(1). The nested-loop alternative is O(n * m). This is a strong argument for keeping membership-tested collections as sets in the first place.

Restricting a working set incrementally. When filters pile up ("must be in stock," "must be in the customer's preferred category," "must be on sale"), apply each as an IntersectWith against the running set. The mutations stack cleanly and there's no allocation overhead beyond the initial copy.

Three filters reduced six SKUs to one. The code reads in the same order the constraints would be described in plain language. Each step mutates working in place, so the only allocation is the initial copy. No temporary lists, no LINQ chains, no per-iteration garbage.

Performance Notes

HashSet<T> performance lives or dies by the quality of GetHashCode. A well-distributed hash spreads entries across buckets so each chain stays short and lookups stay O(1). A bad hash (everything returning the same value, or values that collide for predictable inputs) collapses every operation toward O(n).

For built-in types the runtime handles this well. int.GetHashCode is the integer itself. string.GetHashCode is randomized per process to defend against hash-flooding attacks, which means the same string can hash differently between two runs of the same program, a subtlety that matters when persisting hash codes. Records autogenerate a hash based on their fields. Custom classes with a manual Equals override need a matching GetHashCode, which the compiler will warn about if only one is provided.

OperationAverageWorst case
Add(item)O(1)O(n)
Contains(item)O(1)O(n)
Remove(item)O(1)O(n)
Count (property)O(1)O(1)
Clear()O(n)O(n)
foreachO(n)O(n)
UnionWith(other)O(m)O(n + m)
IntersectWith(other)O(min)O(n + m)
ExceptWith(other)O(m)O(n + m)
SetEquals(other)O(n)O(n + m)

The worst case is what happens when every hash collides. Code using standard equality rules does not hit it. It can be engineered by writing a deliberately bad GetHashCode, which is a useful exercise for understanding the cost model. For correctness, only Equals matters. For speed, GetHashCode matters at least as much.

Re-hashing during resize is a hidden cost. Building a set of a million items without pre-sizing causes the bucket array to resize many times along the way, and each resize is O(current size). Pre-size with the constructor's capacity argument when the final count is approximately known.

Memory usage also matters. Each entry in a HashSet<T> carries a small fixed overhead (the hash code, the next-pointer for chaining, and the value itself), so a set is somewhat heavier per element than a packed array. For sets of value types this overhead is modest; for sets of reference types it's negligible compared to the heap objects the references point to.

For very small sets (a handful of items), a List<T> with a linear Contains can be faster in practice, because the constant factors on the hash table are higher than a tight loop over four entries. The crossover point is small, usually under a dozen elements, but it exists. Benchmark a hot path with a known-small set before assuming HashSet<T> is automatically the answer.

Quiz

HashSet Quiz

10 quizzes