03. Chaining & Open Addressing
When multiple keys map to the same bucket index, a collision occurs. This chapter explores how hash tables resolve this core issue using two fundamentally different memory paradigms.
1. The Collision Reality
There are two main strategies: Separate Chaining (used by Java HashMap) and Open Addressing (used by Python Dictionary and C++ unordered<sub>map</sub>).
Interactive Comparison
Watch how each strategy handles collisions differently when you add multiple keys that map to the same index.
Chaining vs. Linear Probing
Separate Chaining
Linear Probing
Strategy 1: Separate Chaining
Each bucket in the array holds a Linked List (or a dynamic array). When a collision occurs, simply append the new key to the list at that index.
- Pros:
- Simple to implement.
- Gracefully handles high load factors (performance degrades gradually).
- Deletion is easy (just remove from linked list).
- Cons:
- Requires extra memory for pointers.
- Poor cache locality (linked list nodes are scattered in memory).
Strategy 2: Open Addressing (Linear Probing)
All elements are stored directly in the array. If the target bucket is full, we search for the next empty slot using a Probing Sequence.
- Linear Probing: Check
index, thenindex+1, thenindex+2… - Pros:
- Better memory usage (no pointers).
- Excellent cache locality (sequential access).
- Cons:
- Clustering: Groups of occupied slots tend to merge, creating long “runs” that slow down insertions.
- Performance degrades sharply as load factor approaches 1.
- Deletion is tricky (requires “tombstones” or re-inserting elements).
Hardware Truth: CPU Cache vs. Pointers
Modern software is bottlenecked by memory latency, not CPU cycles.
The Chaining Latency
In Separate Chaining, the table contains a pointer to a Node object.
- CPU reads bucket → Pointer address.
- CPU must fetch that address (Pointer Chase).
- Cache Miss Probability: High. Each jump in the linked list likely hits a different “cache line,” forcing a slow trip to main RAM.
The Open Addressing Advantage
In Linear Probing, the data is inline.
- CPU reads bucket i.
- CPU pre-fetches the next 64 bytes (the “Cache Line”) automatically.
- If a collision exists at i+1, the data is already in the L1/L2 Cache.
[!IMPORTANT] Because of this, Linear Probing is often 3x-10x faster than Chaining for small-to-medium datasets, even if it performs more comparisons. In modern systems, a comparison is nearly free; a memory fetch is expensive.
2. Load Factor & Rehashing
The Load Factor (α) is defined as:
α = (Number of Elements) / (Table Size)
- For Chaining, α can be > 1.
- For Open Addressing, α must be ≤ 1.
When α exceeds a threshold (e.g., 0.75), the hash table must Rehash:
- Create a new array double the size.
- Iterate through all old elements.
- Re-insert them into the new array (their new index will be different!).
3. Implementation (Chaining)
Here is how you might build a simple Hash Map from scratch using chaining.