When we think of data structures and performance, the primary metrics to keep in mind are the time complexities of various operations like insertion, deletion, and lookup. Amongst all .NET collections, HashSet stands out as particularly efficient when it comes to membership checks. Here’s why:
1. Underlying Data Structure: The Hash Table
The HashSet<T> in C# is implemented using a hash table. At a high level, a hash table uses a hash function to map keys (or, in the case of HashSet, values) to indices of an array.
For membership checks, this process boils down to:
- Computing the hash code of the item.
- Using this hash code to find an index in the underlying array.
- Checking if the item exists at that location.
Because this process does not depend on the number of elements in the hash table, it has an average time complexity of O(1) thus making it excellent when storing large amounts of data in memory.
2. Collision Resolution
Collisions occur when two different elements have the same hash code. The .NET implementation of HashSet uses a technique called chaining to manage these collisions, which involves maintaining a list of items that hash to the same index.
Even with this, thanks to good hash functions and optimal load factors, HashSet usually keeps the average search time close to O(1).
3. No Extra Data
Unlike other collections like Dictionary or SortedSet, HashSet only deals with unique elements without any additional data like key-value pairs or sorted order. This reduces overhead during membership checks.
4. Dynamic Resizing
As elements are added to a HashSet, it doesn’t just keep growing uncontrollably. Once a certain load factor (ratio of the number of items to the size of the underlying array) is reached, the HashSet resizes itself, ensuring that the average search time remains O(1).
Caveats
It’s important to note that the O(1) time complexity is an average case. In the worst-case scenario (like when all elements collide to the same index), the performance can degrade to O(n). However, with a well-distributed hash function, this is a rare occurrence.
Let’s see a simple benchmark:
[MemoryDiagnoser]
public class ContainsSetVsListVsArray
{
private HashSet<int> _set;
private List<int> _list;
private int[] _array;
[Params(10, 100, 1000, 10000)]
public int Number;
public int ContainedNumber => Number / 2;
[GlobalSetup]
public void GlobalSetup()
{
_set = new HashSet<int>(Enumerable.Range(1, Number));
_list = new List<int>(Enumerable.Range(1, Number));
_array = Enumerable.Range(1, Number).ToArray();
}
[Benchmark]
public bool ContainsSet()
{
return _set.Contains(ContainedNumber);
}
[Benchmark]
public bool ContainsList()
{
return _list.Contains(ContainedNumber);
}
[Benchmark]
public bool ContainsArray()
{
return _array.Contains(ContainedNumber);
}
}
| Method | Number | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|
| ContainsSet | 10 | 3.588 ns | 0.0559 ns | 0.0523 ns | – |
| ContainsList | 10 | 3.942 ns | 0.0884 ns | 0.0827 ns | – |
| ContainsArray | 10 | 4.536 ns | 0.0691 ns | 0.0646 ns | – |
| ContainsSet | 100 | 3.667 ns | 0.0283 ns | 0.0236 ns | – |
| ContainsList | 100 | 5.661 ns | 0.1316 ns | 0.1099 ns | – |
| ContainsArray | 100 | 6.050 ns | 0.0553 ns | 0.0490 ns | – |
| ContainsSet | 1000 | 3.661 ns | 0.0357 ns | 0.0317 ns | – |
| ContainsList | 1000 | 21.461 ns | 0.4565 ns | 0.4885 ns | – |
| ContainsArray | 1000 | 23.524 ns | 0.4211 ns | 0.3733 ns | – |
| ContainsSet | 10000 | 3.465 ns | 0.0334 ns | 0.0313 ns | – |
| ContainsList | 10000 | 178.155 ns | 3.3170 ns | 4.3131 ns | – |
| ContainsArray | 10000 | 236.917 ns | 4.1383 ns | 3.8710 ns | – |
Conclusion
HashSet in C# is an excellent tool when you need a collection that supports rapid membership checks. Thanks to its underlying hash table structure, minimized data overhead, and dynamic resizing, it typically performs membership tests in constant time. However, as with any data structure, it’s essential to understand its strengths and limitations to use it effectively.
*Keep in mind that in .NET 8 FrozenSets are going the be the best option when you need to squeeze that extra performance. We ‘ll talk about those when the time comes.
**Don’t forget to set initial capacity for Hashsets since they use an array internally.
Leave a comment