Write-Ahead Logging (WAL): The Immortal Log

RAM is fast but volatile. Hard Disks are slow but permanent. How do we get the speed of RAM with the safety of a Disk? The Write-Ahead Log (WAL).

1. The Golden Rule

“Never modify a data page in memory until you have written the change record to the log on disk.”

Why?

Imagine you update a user’s balance in RAM.

Without WAL: Power fails. RAM is wiped. Data is lost.
With WAL:
1. Append “User A: +$10” to the Log File on Disk.
2. Update RAM.
3. Power fails.
4. Reboot: Read the Log. “Ah, I see I need to add $10 to User A.” Replay the action.

2. Why WAL is Fast: Rotational Latency

You might ask: “If we have to write to disk anyway, why not just write to the main table?” The Answer: Sequential vs Random I/O.

Main Table Update (Random I/O): To update Row ID 500, the disk arm must physically move (“Seek”) to Sector 500. This takes ~5-10ms on an HDD. It’s painfully slow. (See Latency Numbers).
WAL Append (Sequential I/O): The log is just an “append-only” file. The disk arm never moves; the platter just spins under it. This is ~100x faster.

[!TIP] SSDs? Even on SSDs (which have no moving parts), WAL is beneficial because it reduces “Write Amplification” and extends the life of the drive by batching random writes.

3. The OS Lie: `fsync` and `O_DIRECT`

When you tell the OS to write to a file (write(fd, data)), the OS says “Okay, done!”. It is lying. The OS actually writes the data to its own RAM Buffer Cache (Page Cache) and lazily writes it to the physical disk seconds later. If the power fails, you lose data.

The Solution: `fsync()`

To guarantee Durability, databases must call a special system command: <span class="term-tooltip" tabindex="0" data-definition="File Sync: A system call that transfers all modified in-core data to the disk device.">fsync(fd)</span>. This forces the OS to flush its buffer to the physical disk platter immediately.

The Cost: fsync is incredibly slow (milliseconds vs nanoseconds). It blocks the thread until the disk confirms “I have it”.
The Cheat Code (O_DIRECT): Advanced databases (like ScyllaDB or modern Postgres) use <span class="term-tooltip" tabindex="0" data-definition="Direct I/O: A flag that instructs the operating system to bypass the page cache and write directly to the disk.">O_DIRECT</span>. This bypasses the OS RAM cache entirely and talks directly to the disk controller, giving the database full control over when data is persisted.

4. Interactive Demo: Group Commit vs Sync Commit

Disk I/O is expensive. There is a fundamental trade-off between Throughput and Safety.

Sync Commit (Strict): Call fsync for every transaction.
- Pro: Zero data loss.
- Con: Low TPS (Throughput). The disk spins for every single write.
Group Commit (Fast): Buffer transactions in RAM and fsync once per batch (e.g., every 5ms or 10 txns).
- Pro: Massive TPS.
- Con: If power fails, you lose the entire buffer (Window of Data Loss).

RAM Buffer (Volatile)

DATA LOST!

Holding: 0 Txns

fsync()

Disk Platter (Persistent)

Persisted: 0 Txns

TPS (Throughput)

0ms

Avg Latency

System Normal

5. Deep Dive: ARIES Recovery Algorithm

So the log is on disk. The database crashes. Now what? We use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). It has 3 phases:

Phase 1: Analysis (The “What happened?” phase)

The DB reads the WAL from the last Checkpoint forward.

It builds a “Dirty Page Table” (Which pages were modified in RAM but not Disk?).
It identifies “Loser Transactions” (Transactions that were active at crash time).

Phase 2: Redo (The “Repeating History” phase)

The DB replays the log from the start of the Analysis point.

It re-applies ALL changes (even for Loser Transactions).
Why? This restores the database to the exact state of memory the millisecond before the crash.
This is called “Repeating History”.

Phase 3: Undo (The “Clean up” phase)

The DB scans backwards from the end of the log.

For every “Loser Transaction”, it undoes the changes.
It writes “Compensation Log Records” (CLR) to track this undoing.
Result: Committed data is saved. Uncommitted data is rolled back. Atomicity is preserved.

6. LSN and Checkpoints

If the log grows forever, recovery would take 100 years. We need shortcuts.

LSN (Log Sequence Number)

Every entry in the WAL gets a unique, increasing ID (e.g., LSN: 100, LSN: 101).

Data Page Header: Every data page on disk stores the pageLSN of the last change that touched it.
The Check: If pageLSN >= logLSN, we know the page is up to date. We can skip replaying.

Checkpointing (Taking a Snapshot)

Periodically, the DB forces all dirty pages from RAM to Disk.

Block Writers: Stop all incoming writes (Stop the world).
Flush: Write dirty pages to disk.
Truncate: Delete old WAL files.
Resume: Allow writes.

7. Case Study: Designing a Durable Ledger

Scenario: You are building a core banking ledger for a startup. Requirement: Zero data loss. High throughput is secondary.

The Design

Storage Engine: Use an Append-Only Log (WAL) as the source of truth.
Durability Level: fsync on every transaction.
- Trade-off: Max 100-200 TPS per disk (limited by rotational latency).
Optimization: Use Batching (Group Commit) but with a smart client.
- Client sends transaction.
- Server puts in buffer.
- Server waits 5ms or until 10 txns accumulate.
- Server fsync.
- Server replies “Success” to all clients in the batch.
- Result: If crash happens, clients never received Success, so they can retry safely. No data loss from client perspective (Idempotency required).

8. Summary

WAL converts Random I/O to Sequential I/O (fast).
fsync is the only way to guarantee data is physically on the platter.
Group Commit increases throughput by batching fsync calls, at the risk of losing the buffer if not handled carefully.
ARIES uses Analysis, Redo, and Undo to recover state.