Write Through vs Write Back

In 2015, an AWS region suffered a power spike that took down a cluster of EC2 instances mid-write. A gaming company lost 8 hours of player progress for 400,000 users — their leaderboard service used Write-Back caching with a 1-hour sync interval. Game social media exploded. The engineering post-mortem introduced mandatory Write-Back sync intervals of 5 minutes with AOF persistence on Redis. Conversely, that same year, a payments company switched from Write-Back to Write-Through after a near-miss incident where unsynced balance updates sat in RAM for 3 seconds. The difference between these two strategies isn’t just performance — it’s the risk profile of your entire data layer.

[!IMPORTANT] In this lesson, you will master:

  1. The Consistency Trade-off: Choosing between the “Slow & Steady” (Through) and “Fast & Risky” (Back).
  2. Write Coalescing: How buffering 1000 updates in RAM and flushing once saves your physical disk.
  3. Hardware Intuition: Understanding battery-backed RAID controllers and the power of fsync().

1. The Write Problem

Reading from a cache is simple: check cache, if miss, check DB. Writing is harder. You have two copies of data (Cache and DB). How do you keep them in sync?


2. The Four Strategies

A. Write-Through (The “Safe” Way)

  • How it works: The application writes to the Cache AND the Database synchronously. The write is confirmed only when both succeed.
  • Pros:
  • Strong Consistency: Cache and DB are always identical. See ACID Transactions.
  • Reliability: No data loss if cache crashes.
  • Cons:
  • High Latency: You pay the penalty of the DB write for every single request.
  • Double Write: Every write hits the DB, so it doesn’t reduce write load.
  • Atomic Failure: If the app writes to DB but fails to update Cache, they become inconsistent. Best solved via the Transactional Outbox pattern or CDC.

B. Write-Back / Write-Behind (The “Fast” Way)

  • Terminology: These terms are often used interchangeably. Write-Back usually refers to the CPU/Hardware cache policy, while Write-Behind refers to software (Database/Redis) patterns.
  • How it works: The application writes only to the Cache. The Cache returns “Success” immediately. The Cache asynchronously syncs data to the DB later (e.g., every 5 seconds, or when the item is evicted).
  • Pros:
  • Low Latency: Write speed = RAM speed (~100ns).
  • Write Coalescing: If you update a counter 100 times in 1 second, the DB only sees ONE write (the final value). This massively reduces DB pressure.
  • Cons:
  • Data Loss Risk: If the Cache crashes before syncing, that data is gone forever.
  • Complexity: Implementing this correctly is hard (requires a queue or WAL).
  • Decoupling Solution: Instead of the app handling the write-back, use Change Data Capture (CDC). The app writes only to the DB, and a background process (e.g., Debezium) “tails” the DB logs and updates the cache. This gives you Write-Back reliability without the data loss risk in the app layer.

[!NOTE] Hardware-First Intuition: The “Physical Safety Net”. Professional servers often have a Battery-Backed Cache (BCC) on their RAID controllers. When the OS performs a “Write-Back”, the data lands in this hardware RAM. The controller ACKs the write instantly (nanoseconds), and even if the building loses power, the battery keeps the RAM alive long enough for the controller to flush the data to the disk as soon as power returns. This gives you Write-Back speed with Write-Through safety.

C. Write-Around (The “Big Data” Way)

  • How it works: Write directly to the DB, bypassing the cache. The cache is only populated when the data is read (on a Miss).
  • Use Case: Writing massive data (e.g., Log files, Video uploads) that won’t be read immediately. Prevents the cache from being flooded with useless data (Cache Pollution).
  • Trade-off: Read latency for recently written data is high (Cache Miss).

D. Refresh-Ahead

  • How it works: If a cached item is accessed and is close to expiring (e.g., within 10 seconds of TTL), the cache automatically refreshes the data from the DB in the background.
  • Benefit: The next user never sees a cache miss or latency spike. Excellent for preventing Thundering Herd.

Visual Data Flow

Write-Through

📱
Cache
+
DB
Wait for BOTH

Write-Back

📱
Cache (ACK)
DB (Async)
Fast ACK, Lazy Sync

Write-Around

📱
Skip Cache
DB Only
Good for Logs

3. Decision Tree: Which one to choose?

[!TIP] Try it yourself: Answer the questions to find the perfect write strategy for your use case.

Is Data Loss Acceptable?

Can you afford to lose the last 5 seconds of data if the server crashes?

Is the system Write-Heavy?

Do you have thousands of writes per second?
🛡️

Write-Through

Strong Consistency. Zero Data Loss. Slower writes but safe for financial data.

Write-Back

Maximum Performance. Uses Write Coalescing to reduce DB load. Risk of data loss on crash.
📦

Write-Around

Avoids Cache Pollution. Writes go directly to DB. Good for large files or logs that are rarely read immediately.

4. Interactive Demo: The Power of Write Coalescing

This demo visualizes the massive advantage of Write-Back: Coalescing.

  1. Select Write-Back.
  2. Rapidly click WRITE (+1). Notice the “Pending Updates” count increases in RAM.
  3. Observe that the Database is NOT touched for every click.
  4. Wait for the Async Flush to see one single big write to the DB.
  5. Try Write-Through to see the pain of slow DB writes.

[!TIP] Try it yourself: Select “Write-Back” and click “WRITE” rapidly. Watch the “Pending Updates” count. Then click “CRASH SERVER” to see data loss.

💻 Application
⚡ Cache (RAM) Value: 0 (Pending: 0)
💾 Database (Disk) Value: 0

5. Cache Warming: The Cold Start Problem

When you deploy a new cache server (or restart an existing one), the cache is empty. Every request is a MISS, and your database gets hammered. This is called a Cold Start.

Strategies

A. Lazy Loading (Default - Reactive)

  • How: Cache is empty. Wait for users to request data. Populate cache on MISS.
  • Simple: No special code needed
  • Slow Start: First users experience high latency
  • DB Spike: Database gets hit hard during cold start

B. Eager Loading (Proactive)

  • How: Preload the cache with predictable data before taking traffic
  • Fast Start: Users get instant cache hits
  • Complexity: Requires knowing what to preload
  • Wasted Memory: May load data that never gets accessed

Implementation Example

Bulk Warming Script (Python + Redis):

import redis
import json
from concurrent.futures import ThreadPoolExecutor

cache = redis.Redis()
db = get_database_connection()  # Your DB

def warm_user_profiles():
  """Load top 10K active users into cache"""
  users = db.execute("""
    SELECT id, name, avatar
    FROM users
    ORDER BY last_active DESC
    LIMIT 10000
  """)

  # Batch insert with pipeline (reduces RTT)
  pipe = cache.pipeline()
  for user in users:
    key = f"user:{user['id']}"
    pipe.setex(key, 3600, json.dumps(user))  # 1hr TTL
  pipe.execute()
  print("✅ Warmed 10K user profiles")

def warm_product_catalog():
  """Load entire product catalog (static data)"""
  products = db.execute("SELECT * FROM products WHERE active = true")

  pipe = cache.pipeline()
  for product in products:
    key = f"product:{product['id']}"
    pipe.set(key, json.dumps(product))  # No TTL (static)
  pipe.execute()
  print("✅ Warmed product catalog")

# Run in parallel
with ThreadPoolExecutor(max_workers=5) as executor:
  executor.submit(warm_user_profiles)
  executor.submit(warm_product_catalog)

Production Deployment Pattern

Blue-Green Deployment with Cache Warming:

1. Deploy new server (Green)
2. Run warming script on Green (while still OFF traffic)
3. Wait for cache to populate (monitor hit ratio)
4. Flip traffic from Blue → Green
5. Keep Blue alive for 5min (rollback safety)
6. Decommission Blue

Validation:

# Check cache hit ratio before going live
redis-cli INFO stats | grep keyspace_hits

# Target: >80% hit ratio before production traffic

What to Warm?

Data Type Strategy Example
User Sessions Don’t warm (short-lived) Login tokens expire quickly
Product Catalog Warm fully Static data, predictable access
Top Users Warm top 1% Celebrities, power users (80/20 rule)
Homepage Content Warm fully Everyone sees this
Old Articles Don’t warm Low traffic, unpredictable

Interview Insight: Facebook warms their edge caches with “trending posts” before routing traffic. This prevents cold-start slowness during global events.


6. Summary

Strategy Write Latency Data Loss Risk Best For
Write-Through High (Slow) Zero Financial info, Passwords
Write-Back Low (Fast) High Counters, Likes, Views
Write-Around Low (DB only) High Archival Data, Large Media

Staff Engineer Tip: When using Write-Back, you are essentially creating Dirty Pages. In Linux, you can tune how aggressive the system is about flushing these via sysctl vm.dirty_ratio. A higher ratio allows more coalescing (less disk wear) but increases the “Blast Radius” of a sudden crash.

Mnemonic — “Through is Safe, Back is Fast”: Write-Through = Trustworthy (both Cache + DB confirmed, zero data loss — default for payments). Write-Back = Blazing speed (RAM only, async sync — default for counters, metrics, likes). Ask: “Can I lose data?” YES → Back. “Is write-heavy?” NO → Around. Anything financial: always Through.