Bloom Filters
[!TIP] The Problem: Checking if an item exists in a massive set (1 Billion items) is slow (Disk I/O) or expensive (RAM). The Solution: A data structure that uses very little memory but gives you a Probabilistic answer. “Is X in the set?” → “No” (100% Certain) OR “Maybe” (Small chance of error).
1. How it Works
A Bloom Filter is built around a Bit Array of size m and k independent Hash Functions, as visualized in the pipeline above.
Operations
- Insertion: To add an item (e.g., “Apple”), we pass it through all k hash functions. Each function outputs an index (e.g., Bit 2, Bit 5, Bit 9). We set the value at those indices to
1. - Lookup: To check if “Apple” exists, we re-hash it to find the same indices.
- Yes: If all associated bits are
1, the item Maybe exists. - No: If any bit is
0, the item Definitely Not exists.
- Yes: If all associated bits are
2. False Positives
[!NOTE] War Story: The Chrome Safe Browsing Dilemma Google Chrome uses a local Bloom Filter to check if a URL is potentially malicious before making an expensive network request to the Safe Browsing API. Early on, they realized that keeping an exact copy of millions of bad URLs in memory on every client device was impossible. Instead, they store a tiny, 2MB Bloom Filter. When you visit a site, it hashes the URL. If the filter says “Maybe”, Chrome makes the network call to verify. If it says “No”, it loads the page immediately. This architecture reduced network overhead by 99% while ensuring complete security and keeping Chrome’s memory footprint low.
Why “Maybe”?
- Imagine we add “Banana”. It sets bits
2, 5, 8. - Imagine we add “Car”. It sets bits
9. - Now we check “Apple” (
2, 5, 9). All bits are1! - But “Apple” was never added. This is a False Positive.
Note: You can never have a False Negative. If the bit is 0, it’s 0.
Interactive Demo: The Collision Visualizer
Visualize False Positives.
- Add “Cat” (blue).
- Add “Dog” (blue).
- Check “Fox” (orange). If it collides with existing bits, you get a “False Positive”.
3. Sizing and Math
You can tune the error rate (p) by changing the size (m) and number of hashes (k).
- More Bits (m) = Lower Error.
- More Hashes (k) = Slower, but better precision (up to a point).
- Formula: m = - (n ln p) / (ln 2)2 where n is number of items.
Interactive Sizing Calculator
Estimate your memory usage for a production system.
4. Counting Bloom Filters (CBF)
Standard Bloom Filters do not support Deletion. If you delete “Apple” by setting bits 2, 5, 9 back to 0 (as seen in our diagram), you might accidentally delete “Banana” if it also shared one of those bits.
Solution: Counting Bloom Filter
Instead of the Bit Array (0/1) shown in the visual, a Counting Bloom Filter uses an Integer Array of counters.
- Add: Increment counters (
2: 1->2,5: 1->2). - Delete: Decrement counters (
2: 2->1,5: 2->1). - Trade-off: Takes up 3-4x more space (8 bits per counter vs 1 bit).
| Feature | Standard Bloom Filter | Counting Bloom Filter |
|---|---|---|
| Storage | Bit Array (1 bit/entry) | Integer Array (4-8 bits/entry) |
| Operation | Set Bit to 1 | Increment Counter |
| Deletion | Impossible | Supported (Decrement) |
| Space | Minimal | 4x Larger |
5. Walkthrough: The Math (Dry Run)
Let’s check “Apple” against a small filter.
- Size (m): 10 bits.
- Hashes (k): 2.
Step 1: Insertion (“Apple”)
- Hash 1 (“Apple”) % 10 = 3.
- Hash 2 (“Apple”) % 10 = 7.
- Action: Set Bit 3 and Bit 7 to
1. - Array:
[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
Step 2: Insertion (“Banana”)
- Hash 1 (“Banana”) % 10 = 3 (Collision!).
- Hash 2 (“Banana”) % 10 = 9.
- Action: Set Bit 3 (already 1) and Bit 9 to
1. - Array:
[0, 0, 0, 1, 0, 0, 0, 1, 0, 1]
Step 3: Check (“Cherry”)
- Hash 1 (“Cherry”) % 10 = 3. (Bit 3 is 1).
- Hash 2 (“Cherry”) % 10 = 9. (Bit 9 is 1).
- Result: “Maybe”.
- Reality: FALSE POSITIVE. Cherry was never added, but its hashes collided with Apple and Banana.
6. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Disk IO Reduction | Bloom Filter checks RAM before touching Disk. |
| Low Memory | Bit Array uses ~10 bits per item (1% error rate). |
| Speed | Constant Time O(k) for insertion and lookup. |
| Deletion Support | Counting Bloom Filter (Trade-off: More RAM). |
7. Interview Gauntlet
- Can you resize a Bloom Filter?
- No. Hashing depends on modulo m. If m changes, all indices change. You must rebuild it from scratch.
- What is a False Negative?
- Saying an item is NOT in the set when it IS. Bloom Filters never do this.
- Why not use a HashSet?
- Memory. Storing “Apple” takes 5 bytes. A Bloom Filter takes 5 bits. For 1 Billion items, the difference is GBs vs MBs.
- Optimal number of hash functions?
- k = (m/n) × ln 2. Usually around 7 for 1% error.
- Where is this used in real life?
- Cassandra/HBase: Avoid reading SSTables.
- Chrome: Malicious URL check (local filter before calling server).
- CDN: Cache Summary (What URLs does this cache node have?).
8. Summary: The Whiteboard Strategy
1. Core Concepts
- Probabilistic: Trade accuracy for space.
- Bit Array: Dense storage (0/1).
- Hashing: k independent functions.
- One-Way: Can't retrieve items.
2. Operations
* False Positive: Unlucky collision.
3. Use Cases
4. Constraints
- No Deletion: Requires Counting BF.
- Fixed Size: Cannot grow dynamically (Scalable BF exists but complex).
- Saturation: As it fills, error rate approaches 100%.