Caching Strategies
In 2013, Justin Bieber posted a tweet and celebrity.twitter.com returned 503 errors for 20 minutes. The root cause: a single Redis key for his follower count expired. All 1.2 million followers simultaneously fetched the page, all got a cache miss, and all queried the database simultaneously — a Cache Stampede. Twitter engineers called this the “Celebrity Problem.” The fix required two lines of code: IF TTL < 5s: async_refresh(). Caching seems trivial until it fails at scale. Then it becomes the highest-leverage fix in your entire system.
[!IMPORTANT] In this lesson, you will master:
- The Memory Hierarchy: Why L1 is 200x faster than RAM and 100,000x faster than SSD.
- Temporal & Spatial Locality: The physical reason why
for r in rows: for c in colsis faster than the reverse.- The 3 Demons: Defeating Cache Penetration, Avalanche, and Stampede with Mutexes and Jitter.
1. The “Open Book Exam” Analogy
Imagine you are taking a difficult Open Book Exam.
- Scenario A (No Cache): For every question, you open the textbook, look up the index, find the page, and read the paragraph. This takes 5 minutes per question.
- Scenario B (With Cache): After finding an answer, you write it on a “Cheat Sheet” next to you. If the same question appears again, you glance at the cheat sheet. This takes 5 seconds.
Caching is your Cheat Sheet. It stores the result of an expensive operation (Textbook Search / DB Query) in a temporary, fast storage layer (Cheat Sheet / RAM) so you don’t have to repeat the work.
2. Why Caching Works: Latency Numbers
To understand why we cache, we must look at the “Latency Ladder”. The difference between Memory (RAM) and Disk/Network is astronomical.
[!TIP] Try it yourself: Click “Switch to Human Scale” to see how vast the difference really is.
The Latency Ladder
Takeaway: Reading from RAM (Cache) is orders of magnitude faster than reading from Disk (Database) or Network. A cache hit is the difference between a website feeling “snappy” and “sluggish”.
Effective Latency Calculation
Even a small improvement in Hit Ratio can massively drop effective latency.
[!TIP] Try it yourself: Adjust the sliders to see how Cache Hit Ratio impacts the total latency.
Effective Latency Calculator
(HitRate × Cache) + (MissRate × DB)3. The Hardware Reality: Cache Locality
Software engineers often forget that hardware matters. Caching isn’t just about key-value stores like Redis; it happens inside the CPU (L1/L2/L3).
CPU Cache relies on Locality of Reference:
- Temporal Locality: If you access data at address X, you will likely access X again soon. (Loop variables).
- Spatial Locality: If you access data at address X, you will likely access X+1 soon. (Arrays).
The 2D Array Trap
Traversing a matrix in the “wrong” direction can be 10x slower due to Cache Misses (breaking Spatial Locality).
import time
import numpy as np
# Create a large matrix (10k x 10k)
size = 10000
matrix = np.ones((size, size), dtype=np.int8)
# Case A: Row-Major (Good Spatial Locality)
# Memory is stored: [Row1][Row1][Row1]...[Row2][Row2]
start = time.time()
sum_a = 0
for r in range(size):
for c in range(size):
sum_a += matrix[r][c] # Access r, then r+1...
print(f"Row-Major Time: {time.time() - start:.4f}s")
# Case B: Column-Major (Bad Spatial Locality)
# We jump: [Row1]...[Row2]...[Row3]...
start = time.time()
sum_b = 0
for c in range(size):
for r in range(size):
sum_b += matrix[r][c] # Jumping memory rows constantly!
print(f"Col-Major Time: {time.time() - start:.4f}s")
Result: Row-Major is significantly faster because it hits the CPU Cache line.
[!NOTE] Hardware-First Intuition: The CPU doesn’t move data in bytes; it moves them in Cache Lines (typically 64 bytes). When you read
matrix[0][0], the hardware automatically pulls the next 63 bytes into the L1 cache for “free”. If you accessmatrix[0][1]next, it’s an instant L1 Hit. But if you jump tomatrix[1][0], you force the CPU to throw away the line and wait for a completely new 64-byte block from slow RAM.
4. Where to Cache? (Layers of Caching)
Caching is not just one thing. It happens at every layer of the request lifecycle.
- Browser Cache:
Cache-Controlheaders tell Chrome to store images locally. (Zero Network). - CDN (Content Delivery Network): Cloudflare/AWS CloudFront stores static assets close to the user. (Short Network).
- Load Balancer: Nginx/HAProxy can cache entire HTML pages. See Load Balancing.
- Application Cache: Redis/Memcached stores expensive query results. (Fast Network).
- Database Cache: Postgres has an internal Buffer Pool (Shared Buffers) to keep hot rows in RAM. See Database Basics.
5. Implementation: Cache-Aside vs Read-Through
A. Cache-Aside (Lazy Loading) - Most Common
The Application is responsible for orchestrating the flow. The cache is “dumb”.
- Read: App asks Cache.
- Miss: If Cache is empty, App asks Database.
- Set: App writes the DB result into Cache.
Pros: Resilient to cache failure (app just hits DB). Cons: Code complexity (logic in app), potential for stale data.
B. Read-Through
The Cache sits between the App and the DB. The App treats the Cache as the main data store.
- Read: App asks Cache.
- Miss: Cache (not App) fetches from DB, updates itself, and returns data to App.
Pros: Simpler app code. Cons: Requires a smarter cache (plugin/provider support), single point of failure.
Python Decorator Example (Cache-Aside)
import redis
import json
import logging
from functools import wraps
# Connect to Redis
cache = redis.Redis(host='localhost', port=6379, db=0)
def cached(ttl_seconds=60):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
key = f"{func.__name__}:{str(args)}"
# 1. Check Cache (Read)
try:
result = cache.get(key)
if result:
return json.loads(result)
except redis.RedisError as e:
# FAIL OPEN: If cache is down, just hit the DB. Don't crash.
logging.error(f"Cache Read Error: {e}")
# 2. Compute/Fetch (DB Call)
val = func(*args, **kwargs)
# 3. Set Cache (Write)
try:
if val is not None:
cache.setex(key, ttl_seconds, json.dumps(val))
except redis.RedisError as e:
logging.error(f"Cache Write Error: {e}")
return val
return wrapper
return decorator
@cached(ttl_seconds=10)
def get_user_profile(user_id):
# Simulate DB Call
print(f"Fetching User {user_id} from DB...")
return {"id": user_id, "name": "Alice", "role": "Admin"}
6. Interactive Demo: Cache Hit vs Miss & DB Load
Experience the latency difference and the impact on your Database.
- Cache Hit (Green): Data found in Memory. Latency ~1ms. DB Load: 0%.
- Cache Miss (Red): Data not found. Must fetch from DB. Latency ~100ms. DB Load Increases.
- Capacity: Watch the bar fill up. When full, we need an eviction policy.
[!TIP] Try it yourself: Enter a key (e.g., “A”) and click FETCH. Do it again to see a Cache Hit.
7. The Three Cache Demons
When designing large-scale systems, watch out for these three failures:
A. Cache Penetration
- The Problem: A user keeps asking for a key that doesn’t exist in the DB (e.g.,
id=-1). - The Impact: The Cache misses, the DB misses. If 10,000 bots do this, your DB dies.
- The Fix:
- Cache Nulls: If DB returns
null, cache thatnullfor a short time (e.g., 30s). - Bloom Filters: A fast probabilistic check to see if
idmight exist before hitting the cache/DB.
- Cache Nulls: If DB returns
B. Cache Avalanche
- The Problem: Thousands of cache keys expire at the exact same time (e.g., you rebooted the server at midnight).
- The Impact: Suddenly, requests fall through to the DB simultaneously.
- The Fix: Add Jitter (Randomness) to TTL.
- Bad:
TTL = 3600s(Everyone expires at 1:00 PM). - Good:
TTL = 3600s + random(0, 300s)(Expires spread between 1:00 PM and 1:05 PM).
C. Thundering Herd vs Cache Stampede
These terms are often used interchangeably, but there is a nuance:
- Thundering Herd: Happens when many processes (clients) wake up simultaneously to compete for a resource (like a lock).
-
Cache Stampede: Specifically refers to the event where a Hot Key expires, and thousands of parallel requests hit the Database to recompute it.
- The Impact: 10,000 users try to fetch “Justin Bieber’s Profile”. All get a “Miss”. All hit the DB.
- The Fix:
- Mutex Lock: Only allow ONE thread to rebuild the cache for that key. Others wait.
- Refresh-Ahead: Update the cache in the background before it expires.
Deep Dive: Mutex Lock with Redis SETNX
The problem: When 1000 threads all get a cache miss simultaneously, they all try to recompute the expensive value and set it in cache. This creates a stampede on your database.
Solution: Use a distributed lock so only ONE thread recomputes. Others wait.
import redis
import time
cache = redis.Redis()
def get_with_mutex(key, ttl=60, lock_timeout=10):
# Try cache first
value = cache.get(key)
if value:
return value
# Cache miss - try to acquire lock
lock_key = f"lock:{key}"
lock_acquired = cache.setnx(lock_key, "1")
if lock_acquired:
# WE won the race! Compute the value.
cache.expire(lock_key, lock_timeout) # Prevent deadlock
try:
# Expensive computation
result = expensive_db_query(key)
cache.setex(key, ttl, result)
return result
finally:
# Release lock
cache.delete(lock_key)
else:
# Someone else is computing. Wait and retry.
time.sleep(0.1)
return get_with_mutex(key, ttl, lock_timeout)
def expensive_db_query(key):
# Simulate slow DB
time.sleep(2)
return f"Data for {key}"
How SETNX works:
SETNX= “SET if Not eXists”- Returns
1if key didn’t exist (lock acquired) - Returns
0if key exists (someone else has the lock) - Atomic operation: No race condition possible
Critical: Always set an expiration on the lock key (expire) to prevent deadlock if the thread crashes.
Alternative: Refresh-Ahead Pattern
Instead of waiting for expiration, refresh the cache proactively.
def get_with_refresh_ahead(key, ttl=60, refresh_threshold=0.8):
value = cache.get(key)
ttl_remaining = cache.ttl(key)
if value:
# Check if we're in the "danger zone"
if ttl_remaining < (ttl * refresh_threshold):
# Trigger async refresh (non-blocking)
threading.Thread(
target=async_refresh,
args=(key, ttl)
).start()
return value
# Cache miss - compute synchronously
result = expensive_db_query(key)
cache.setex(key, ttl, result)
return result
def async_refresh(key, ttl):
result = expensive_db_query(key)
cache.setex(key, ttl, result)
Trade-off:
- [+] Pros: No stampede, users never wait
- [!] Cons: Wasted computation if key is rarely used
Best Practice: Use refresh-ahead for hot keys (high traffic). Use mutex for normal keys.
[!WARNING] Pro Tip: Always set a
TTL(Time To Live). A cache without expiration is just a memory leak waiting to happen.
Staff Engineer Tip: When caching, consider the Write Tax on SSDs. Frequent cache updates can contribute to Write Amplification, wearing out physical flash memory cells. If your cache has a very short TTL (e.g., 1 second) and high write volume, it might be better to keep it entirely in Volatile RAM (like Redis) rather than persisting it to a persistent block store.
Mnemonic for the 3 Demons — “PAS”: Penetration (fake keys → fix with Bloom Filter or cache nulls), Avalanche (mass expiry → fix with TTL Jitter), Stampede (hot key expires → fix with Mutex/Refresh-Ahead). If you can recite PAS fixes from memory, you’ve nailed the hardest caching question in any system design interview.