Write-Ahead Logging (WAL): The Immortal Log
Write-Ahead Logging (WAL): The Immortal Log
RAM is fast but volatile. Hard Disks are slow but permanent. How do we get the speed of RAM with the safety of a Disk? The Write-Ahead Log (WAL).
1. The Golden Rule
“Never modify a data page in memory until you have written the change record to the log on disk.”
Why?
Imagine you update a user’s balance in RAM.
- Without WAL: Power fails. RAM is wiped. Data is lost.
- With WAL:
- Append “User A: +$10” to the Log File on Disk.
- Update RAM.
- Power fails.
- Reboot: Read the Log. “Ah, I see I need to add $10 to User A.” Replay the action.
2. Why WAL is Fast: Rotational Latency
You might ask: “If we have to write to disk anyway, why not just write to the main table?” The Answer: Sequential vs Random I/O.
- Main Table Update (Random I/O): To update Row ID 500, the disk arm must physically move (“Seek”) to Sector 500. This takes ~5-10ms on an HDD. It’s painfully slow. (See Latency Numbers).
- WAL Append (Sequential I/O): The log is just an “append-only” file. The disk arm never moves; the platter just spins under it. This is ~100x faster.
[!TIP] SSDs? Even on SSDs (which have no moving parts), WAL is beneficial because it reduces “Write Amplification” and extends the life of the drive by batching random writes.
3. The OS Lie: fsync and O_DIRECT
When you tell the OS to write to a file (write(fd, data)), the OS says “Okay, done!”.
It is lying.
The OS actually writes the data to its own RAM Buffer Cache (Page Cache) and lazily writes it to the physical disk seconds later. If the power fails, you lose data.
The Solution: fsync()
To guarantee Durability, databases must call a special system command: <span class="term-tooltip" tabindex="0" data-definition="File Sync: A system call that transfers all modified in-core data to the disk device.">fsync(fd)</span>.
This forces the OS to flush its buffer to the physical disk platter immediately.
- The Cost:
fsyncis incredibly slow (milliseconds vs nanoseconds). It blocks the thread until the disk confirms “I have it”. - The Cheat Code (O_DIRECT): Advanced databases (like ScyllaDB or modern Postgres) use
<span class="term-tooltip" tabindex="0" data-definition="Direct I/O: A flag that instructs the operating system to bypass the page cache and write directly to the disk.">O_DIRECT</span>. This bypasses the OS RAM cache entirely and talks directly to the disk controller, giving the database full control over when data is persisted.
4. Interactive Demo: Group Commit vs Sync Commit
Disk I/O is expensive. There is a fundamental trade-off between Throughput and Safety.
- Sync Commit (Strict): Call
fsyncfor every transaction.- Pro: Zero data loss.
- Con: Low TPS (Throughput). The disk spins for every single write.
- Group Commit (Fast): Buffer transactions in RAM and
fsynconce per batch (e.g., every 5ms or 10 txns).- Pro: Massive TPS.
- Con: If power fails, you lose the entire buffer (Window of Data Loss).
5. Deep Dive: ARIES Recovery Algorithm
So the log is on disk. The database crashes. Now what? We use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). It has 3 phases:
Phase 1: Analysis (The “What happened?” phase)
The DB reads the WAL from the last Checkpoint forward.
- It builds a “Dirty Page Table” (Which pages were modified in RAM but not Disk?).
- It identifies “Loser Transactions” (Transactions that were active at crash time).
Phase 2: Redo (The “Repeating History” phase)
The DB replays the log from the start of the Analysis point.
- It re-applies ALL changes (even for Loser Transactions).
- Why? This restores the database to the exact state of memory the millisecond before the crash.
- This is called “Repeating History”.
Phase 3: Undo (The “Clean up” phase)
The DB scans backwards from the end of the log.
- For every “Loser Transaction”, it undoes the changes.
- It writes “Compensation Log Records” (CLR) to track this undoing.
- Result: Committed data is saved. Uncommitted data is rolled back. Atomicity is preserved.
6. LSN and Checkpoints
If the log grows forever, recovery would take 100 years. We need shortcuts.
LSN (Log Sequence Number)
Every entry in the WAL gets a unique, increasing ID (e.g., LSN: 100, LSN: 101).
- Data Page Header: Every data page on disk stores the
pageLSNof the last change that touched it. - The Check: If
pageLSN >= logLSN, we know the page is up to date. We can skip replaying.
Checkpointing (Taking a Snapshot)
Periodically, the DB forces all dirty pages from RAM to Disk.
- Block Writers: Stop all incoming writes (Stop the world).
- Flush: Write dirty pages to disk.
- Truncate: Delete old WAL files.
- Resume: Allow writes.
7. Case Study: Designing a Durable Ledger
Scenario: You are building a core banking ledger for a startup. Requirement: Zero data loss. High throughput is secondary.
The Design
- Storage Engine: Use an Append-Only Log (WAL) as the source of truth.
- Durability Level:
fsyncon every transaction.- Trade-off: Max 100-200 TPS per disk (limited by rotational latency).
- Optimization: Use Batching (Group Commit) but with a smart client.
- Client sends transaction.
- Server puts in buffer.
- Server waits 5ms or until 10 txns accumulate.
- Server
fsync. - Server replies “Success” to all clients in the batch.
- Result: If crash happens, clients never received Success, so they can retry safely. No data loss from client perspective (Idempotency required).
8. Summary
- WAL converts Random I/O to Sequential I/O (fast).
- fsync is the only way to guarantee data is physically on the platter.
- Group Commit increases throughput by batching
fsynccalls, at the risk of losing the buffer if not handled carefully. - ARIES uses Analysis, Redo, and Undo to recover state.