Hash Functions & Collisions

[!NOTE] This module explores the core principles of Hash Functions & Collisions, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: Instant Retrieval

How do we find data instantly without scanning every element? Answer: Deterministic Mapping.

We use a Hash Function to convert a “Key” (String, Object) into an “Index” (Integer). This allows us to jump directly to the memory address where the data is stored. It is the “teleportation” of data structures.


2. The 4 Pillars of Hashing Mastery

Pillar Focus Why it matters
1. Intuition Space-Time Tradeoff Memory is cheap; time is expensive.
2. Rigor Birthday Paradox Understanding why collisions happen so early (O(√N)).
3. Hardware Open Addressing Exploiting CPU caches with linear probing vs. pointer chasing.
4. Patterns Sliding & Summing Solving Subarray Sums and String Search in O(N).

3. The Core Concept

Imagine a library where instead of searching for a book alphabetically (O(log N)), you have a magic formula. You feed the book title into the formula, and it tells you exactly which shelf and slot the book is in.


4. Interactive: Hash Table Visualizer

See how keys map to buckets in a small hash table (Size 8).


5. The Mathematical Anchor: The Birthday Paradox

Why do we need collision handling so early? If we have 1000 buckets, shouldn’t we be safe until we have 1000 items? No.

The Probability Trap

In a room of 23 people, there is a 50% chance that two people share a birthday.

  • There are 365 days (buckets).
  • We only have 23 people (items).
  • Intuition (Pigeonhole) says we are safe. Math (Probability) says we are colliding.

For Hash Tables

The number of items k needed to find a collision with probability P in N buckets is approximately: k ≈ √(2N ln(1/(1-P)))

For a table of size M, you can expect a collision after roughly √M insertions.

  • For M=1,000, collisions start around 32 items.
  • This is why we use a Load Factor (typically 0.75) and why robust collision resolution isn’t “nice to have”—it’s a mathematical requirement.

6. The Collision Problem

The Pigeonhole Principle states that if you have N items and M buckets, and N > M, at least one bucket must contain more than one item. Since the universe of keys is infinite and memory is finite, collisions are inevitable.

Collision Resolution Strategies

  1. Separate Chaining: Each bucket holds a linked list. If a collision happens, append to the list. (Most common).
  2. Open Addressing: If the bucket is full, find the next available slot (Linear/Quadratic probing).

7. Implementation: Standard Map Usage

Java (HashMap)

Map<String, Integer> map = new HashMap<>();

// Insert: O(1) Average
map.put("Alice", 25);

// Lookup: O(1) Average
if (map.containsKey("Alice")) {
  int age = map.get("Alice");
}

Go (map)

m := make(map[string]int)

// Insert: O(1) Average
m["Alice"] = 25

// Lookup: O(1) Average
age, exists := m["Alice"]

8. Summary & Complexity

  • Time Complexity: Average O(1), Worst Case O(N) (if all keys collide).
  • Hash Qualities: Deterministic, Uniformly Distributed, Fast.
  • Load Factor: Usually 0.75. Beyond this, the table resizes (doubles) to maintain O(1) performance.