Write-Ahead Logging (WAL): The Immortal Log
In 2012, a major cloud provider’s database cluster lost 6 hours of customer data in a regional outage. Post-mortem: their storage engine was using O_SYNC on the OS page cache but forgot to account for the disk controller’s own write buffer — a separate RAM cache on the physical drive that was powered by capacitor, not battery-backed. The fsync() call returned success, but data was still in the controller buffer when power failed. The fix: use battery-backed RAID controllers or NVMe drives with power-loss protection. WAL is not just a software concept — it’s a contract involving the OS, disk controller, and physical storage. Understanding where each lies in this chain determines whether your “committed” data actually survives a power cut.
[!IMPORTANT] In this lesson, you will master:
- Sequential I/O Speed: Why appending is 100x faster than updating in-place.
- The ARIES Algorithm: The industry standard for Analysis, Redo, and Undo recovery.
- Hardware Intuition: Moving from RAM Buffer Caches to physical Persistent Platter via
fsync().
1. The Golden Rule
“Never modify a data page in memory until you have written the change record to the log on disk.”
Why?
Imagine you update a user’s balance in RAM.
- Without WAL: Power fails. RAM is wiped. Data is lost.
- With WAL:
- Append “User A: +$10” to the Log File on Disk.
- Update RAM.
- Power fails.
- Reboot: Read the Log. “Ah, I see I need to add $10 to User A.” Replay the action.
2. Why WAL is Fast: Rotational Latency
You might ask: “If we have to write to disk anyway, why not just write to the main table?” The Answer: Sequential vs Random I/O.
- WAL Append (Sequential I/O): The log is just an “append-only” file. The disk arm never moves; the platter just spins under it. This is ~100x faster.
[!NOTE] Hardware-First Intuition: Think of an HDD like a record player. To play a specific song (Random I/O), you must pick up the needle and move it to a different radius (Seek Time), then wait for the disk to spin until the start of the song reaches the needle (Rotational Latency). WAL avoids the “Seek” entirely by keeping the needle down and streaming data continuously. This is why sequential throughput is measured in MB/s while random is measured in IOPS.
[!TIP] SSDs? Even on SSDs (which have no moving parts), WAL is beneficial because it reduces “Write Amplification” and extends the life of the drive by batching random writes.
3. The OS Lie: fsync and O_DIRECT
When you tell the OS to write to a file (write(fd, data)), the OS says “Okay, done!”.
It is lying.
The OS actually writes the data to its own RAM Buffer Cache (Page Cache) and lazily writes it to the physical disk seconds later. If the power fails, you lose data.
The Solution: fsync()
To guarantee Durability, databases must call a special system command: <span class="term-tooltip" tabindex="0" data-definition="File Sync: A system call that transfers all modified in-core data to the disk device.">fsync(fd)</span>.
This forces the OS to flush its buffer to the physical disk platter immediately.
- The Cost:
fsyncis incredibly slow (milliseconds vs nanoseconds). It blocks the thread until the disk confirms “I have it”. - The Cheat Code (O_DIRECT): Advanced databases (like ScyllaDB or modern Postgres) use
<span class="term-tooltip" tabindex="0" data-definition="Direct I/O: A flag that instructs the operating system to bypass the page cache and write directly to the disk.">O_DIRECT</span>. This bypasses the OS RAM cache entirely and talks directly to the disk controller, giving the database full control over when data is persisted.
4. Interactive Demo: Group Commit vs Sync Commit
Disk I/O is expensive. There is a fundamental trade-off between Throughput and Safety.
- Sync Commit (Strict): Call
fsyncfor every transaction. - Pro: Zero data loss.
- Con: Low TPS (Throughput). The disk spins for every single write.
- Group Commit (Fast): Buffer transactions in RAM and
fsynconce per batch (e.g., every 5ms or 10 txns). - Pro: Massive TPS.
- Con: If power fails, you lose the entire buffer (Window of Data Loss).
[!TIP] Try it yourself: Switch between Sync (Safe) and Group Commit (Fast) modes. Click “SIMULATE POWER CUT” to see if any data is lost in the RAM buffer.
5. Deep Dive: ARIES Recovery Algorithm
So the log is on disk. The database crashes. Now what? We use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). It has 3 phases:
The 3 Phases of Recovery
Phase 1: Analysis (The “What happened?” phase)
The DB reads the WAL from the last Checkpoint forward.
- It builds a “Dirty Page Table” (Which pages were modified in RAM but not Disk?).
- It identifies “Loser Transactions” (Transactions that were active at crash time).
Phase 2: Redo (The “Repeating History” phase)
The DB replays the log from the start of the Analysis point.
- It re-applies ALL changes (even for Loser Transactions).
- Why? This restores the database to the exact state of memory the millisecond before the crash.
-
This is called “Repeating History”.
- Result: Committed data is saved. Uncommitted data is rolled back. Atomicity is preserved.
Staff Engineer Tip: Physiological Logging. ARIES doesn’t just store “logical” changes (like UPDATE val=10). It stores Physiological logs. This means the log specifies the physical page affected but identifies the change logically within that page. Why? Because replaying a purely logical change on a row that moved due to internal fragmentation is impossible. Physiological logging is the perfect balance between high-performance physical logs and portable logical logs.
War Story: The “Infinite” Crash Loop At a high-frequency trading firm, a database crashed during peak market hours. Recovery started (Analysis, Redo, Undo) but the system crashed again during the Undo phase. If the database didn’t have ARIES’s concept of Compensation Log Records (CLRs), it would have kept attempting to undo the same transaction forever, entering an infinite crash loop. Because ARIES logs its own undo operations, the firm’s database cleanly recovered on the second reboot in seconds instead of hours.
6. LSN and Checkpoints
If the log grows forever, recovery would take 100 years. We need shortcuts.
LSN (Log Sequence Number)
Every entry in the WAL gets a unique, increasing ID (e.g., LSN: 100, LSN: 101).
- Data Page Header: Every data page on disk stores the
pageLSNof the last change that touched it. - The Check: If
pageLSN >= logLSN, we know the page is up to date. We can skip replaying.
Checkpointing (Taking a Snapshot)
Periodically, the DB forces all dirty pages from RAM to Disk.
- Block Writers: Stop all incoming writes (Stop the world).
- Flush: Write dirty pages to disk.
- Truncate: Delete old WAL files.
- Resume: Allow writes.
7. Case Study: Designing a Durable Ledger
Scenario: You are building a core banking ledger for a startup. Requirement: Zero data loss. High throughput is secondary.
The Design
- Storage Engine: Use an Append-Only Log (WAL) as the source of truth.
- Durability Level:
fsyncon every transaction.- Trade-off: Max 100-200 TPS per disk (limited by rotational latency).
- Optimization: Use Batching (Group Commit) but with a smart client.
- Client sends transaction.
- Server puts in buffer.
- Server waits 5ms or until 10 txns accumulate.
- Server
fsync. - Server replies “Success” to all clients in the batch.
- Result: If crash happens, clients never received Success, so they can retry safely. No data loss from client perspective (Idempotency required).
8. Summary
- WAL converts Random I/O to Sequential I/O (fast).
- fsync is the only way to guarantee data is physically on the platter.
- Group Commit increases throughput by batching
fsynccalls, at the risk of losing the buffer if not handled carefully. - ARIES uses Analysis, Redo, and Undo to recover state.
Mnemonic for ARIES: “Are you Recovered? — Analyze, Redo, Undo” (ARU). Three passes: forward to identify what happened, forward again to repeat committed work, backward to undo uncommitted transactions.
Staff Engineer Tip: Group Commit is a Safety Feature, Not Just Performance. Engineers implement Group Commit purely for throughput. But the real safety guarantee is the client contract: if fsync is called every 5 transactions and the server crashes before responding, the client never receives Success, so it safely retries. This requires strict idempotency keys on the client side. Without idempotency, Group Commit’s “window of loss” becomes a “window of duplication” on retry — worse than data loss in financial systems.