Isolation Levels: Controlling Chaos

In 2019, a hospital booking system double-assigned 3 ICU beds during a weekend surge. The root cause: a “check-then-act” race condition under Snapshot Isolation — the exact Write Skew anomaly described in this chapter. Two nurses simultaneously checked availability, saw 1 bed free, and both assigned it to different patients. The fix required a single line of SQL: SELECT ... FOR UPDATE. In the same year, a fintech startup discovered phantom reads in their tax calculation reports — the same transaction would return different sums because background jobs were inserting new transactions mid-report. These aren’t rare edge cases. They are the day-to-day consequences of choosing the wrong isolation level.

[!IMPORTANT] In this lesson, you will master:

  1. Concurrency Anomalies: Why Snapshot Isolation (MySQL/Postgres) can still lead to “Write Skew”.
  2. MVCC Costs: Understanding the CPU/RAM trade-off of versioning vs. traditional locking.
  3. Hardware Intuition: How row-level locks trigger context switching and CPU thrashing.

1. The Concurrency Anomalies

To understand Isolation, you must first understand the bugs it prevents.

A. Dirty Read

  • Scenario: Transaction A writes a value but hasn’t committed yet. Transaction B reads that uncommitted value.
  • Danger: If A rolls back, B is working with data that “never existed”.

B. Non-Repeatable Read

  • Scenario: Transaction A reads a row (Balance: 100). Transaction B updates it to 200 and commits. Transaction A reads it again and sees $200.
  • Problem: Math is broken. “I just checked, and it was $100!”

C. Phantom Read

  • Scenario: Transaction A runs SELECT * FROM Users WHERE Age > 18 (Gets 10 rows). Transaction B inserts a new 20-year-old user. Transaction A runs the same query again and gets 11 rows.
  • Problem: “Where did this ghost come from?”
  • Note: Non-Repeatable Read is about modifying existing rows. Phantom Read is about adding/removing rows.

2. Deep Dive: Write Skew (The Silent Killer)

There is a 4th anomaly that most people forget, but it causes huge bugs in Snapshot Isolation (used by Repeatable Read).

The Scenario: The “Doctor On Call” Problem

  • Rule: At least one doctor must be on call at the hospital.
  • Current State: Alice (On Call), Bob (On Call).
  • The Goal: Both Alice and Bob want to leave (go Off Call).

The Race Condition:

  1. Alice (Txn A) checks the schedule. She sees 2 doctors on call. “Okay, 2 > 1. I can leave.”
  2. Bob (Txn B) checks the schedule at the same time. He also sees 2 doctors on call (because Alice hasn’t committed yet). “Okay, 2 > 1. I can leave.”
  3. Alice updates her status to “Off Call” and commits.
  4. Bob updates his status to “Off Call” and commits.
  5. Result: Zero doctors on call. The invariant is broken.

This is Write Skew. Both transactions read the same data, made a decision based on it, but their updates were disjoint (different rows), so the database didn’t detect a direct conflict.


3. The Four Levels (SQL Standard)

Databases let you choose your safety level. Higher safety = Lower performance.

Isolation Level Dirty Read? Non-Repeatable? Phantom? Write Skew? Performance
Read Uncommitted ✅ Yes ✅ Yes ✅ Yes ✅ Yes 🚀 Fastest
Read Committed ❌ No ✅ Yes ✅ Yes ✅ Yes ⚡ Fast (Default: Postgres)
Repeatable Read ❌ No ❌ No ✅ Yes ✅ Yes 🐢 Moderate (Default: MySQL)
Serializable ❌ No ❌ No ❌ No ❌ No 🐌 Slowest

Deep Dive: MVCC (Multi-Version Concurrency Control)

Most modern databases (Postgres, MySQL InnoDB) achieve isolation without locking readers using MVCC.

[!NOTE] Hardware-First Intuition: Traditional locking (2PL) is an Availability Killer. If Txn A locks a row, Txn B’s CPU thread must “Sleep”, forcing a Context Switch. MVCC avoids this by using RAM to store multiple versions of a row. This trades memory (for version storage) and background CPU (for Vacuum/GC) to keep your “Foreground” application threads running at full speed without blocking.

The Core Idea: Instead of overwriting data, create a new version.

MVCC: MULTI-VERSION CONCURRENCY CONTROL
ROW VERSION: v1 Balance: $100 Created By: TxID 90 Expired At: TxID 100 VISIBLE TO TxIDs < 100 Pointer to New ROW VERSION: v2 (Latest) Balance: $80 Created By: TxID 100 Expired At: ∞ VISIBLE TO TxIDs ≥ 100 CONCURRENT READERS SCENARIO 👤 TxID 99 (Analyst) Sees v1 (Bal: $100) 📉 TxID 105 (Reporting) Sees v2 (Bal: $80)
Timeline: User Balance Updates

T1: BEGIN TRANSACTION (TxID = 100)
  Row: [id=1, balance=$100, created_by_tx=90, visible_to_tx=∞]

T2: UPDATE balance = $80 (TxID = 100)
  Old Row: [id=1, balance=$100, created_by_tx=90, visible_to_tx=100]  ← Still exists!
  New Row: [id=1, balance=$80, created_by_tx=100, visible_to_tx=∞]  ← New version

T3: Another transaction (TxID = 99) reads
  → Sees $100 (because 99 < 100, it reads the old version)

T4: COMMIT (TxID = 100)
  → Now all new transactions (TxID ≥ 101) see $80

Version Chain: Each row can have multiple versions linked by transaction IDs.

Postgres vs MySQL MVCC Implementation

Aspect Postgres MySQL (InnoDB)
Storage Old versions stored in-place (TOAST for large) Old versions in Undo Log
Read Overhead Low (Direct tuple access) Medium (May need undo log lookup)
Garbage Collection VACUUM (Background process) Purge Thread (Automatic)
Version Visibility Based on snapshot XID Based on read view
Write Performance Slower (In-place bloat) Faster (Undo log is sequential)
Snapshot Creation Per-transaction (Copy XID list) Per-statement (Lightweight)

Interview Insight: Postgres VACUUM is notorious for causing performance issues if not tuned. MySQL’s purge thread is more predictable.

Staff Engineer Tip: The SIREAD Lock. Serializable Snapshot Isolation (SSI) doesn’t use standard locks that block other writers. instead, it records SIREAD locks (special bookmarks) in a central graph. If a transaction tries to commit and it would create a cycle (two transactions relying on data that the other just modified), the database aborts it with a “Serialization Failure”. It is an optimistic strategy that assumes conflicts are rare.


How Serializable Works (The Nuclear Option)

  1. 2PL (Two-Phase Locking): The classic approach. It locks everything you read. If you read 10 rows, no one else can touch them. (Very slow).
  2. SSI (Serializable Snapshot Isolation): The modern approach (Postgres). It uses MVCC but tracks “dependencies”. If it detects a conflict (like the Doctor scenario), it aborts one transaction. It is optimistic and much faster than 2PL.

4. Interactive Demo: Isolation Arena

Experience the concurrency bugs first-hand. Select a mode to simulate specific anomalies.

[!TIP] Try it yourself: Select a mode (Dirty Read, Phantom Read, Write Skew) and follow the steps to trigger the bug.

⚔️ ISOLATION ARENA: ANOMALY EXPLORER
Txn A READER/WRITER
> INITIALIZING TRANSACTION A...
Txn B THE DISTURBANCE
> INITIALIZING TRANSACTION B...

5. Case Study: E-commerce Inventory (Lost Updates)

How do massive sites like Amazon prevent selling 1 item to 2 people? (See also Flash Sales).

The Problem

Item X has Quantity = 1.

  1. User A adds to cart. System checks inventory: 1. OK.
  2. User B adds to cart. System checks inventory: 1. OK.
  3. User A checks out. UPDATE items SET qty = 0 WHERE id = X.
  4. User B checks out. UPDATE items SET qty = 0 WHERE id = X.
  5. Result: 2 Items Sold. 1 in stock. Overselling.

[!NOTE] War Story: The $10,000 Flash Sale Oversell A popular sneaker company once ran a flash sale without proper isolation levels. They used Read Committed, and allowed 500 users to concurrently read an inventory count of 1 for a rare shoe. All 500 users proceeded to checkout, their transactions committed successfully, and the database recorded -499 in stock. The company lost over $10,000 in refunds and apology credits. The fix? Using SELECT ... FOR UPDATE to lock the inventory row, forcing subsequent reads to wait until the inventory was updated and correctly reflected 0.

The Solution: Isolation Levels vs Locking

Approach A: Pessimistic Locking (SELECT FOR UPDATE)

Lock the row when reading.

BEGIN;
SELECT * FROM items WHERE id = 'X' FOR UPDATE;
-- User B blocks here until User A commits.
UPDATE items SET qty = 0 WHERE id = 'X';
COMMIT;
  • Pros: 100% Safe.
  • Cons: Slow. Kills concurrency.

Approach B: Optimistic Locking (Versioning)

Add a version column.

-- User A reads: Qty=1, Ver=5
-- User B reads: Qty=1, Ver=5

-- User A updates:
UPDATE items SET qty=0, ver=6 WHERE id='X' AND ver=5;
-- Success (1 row updated).

-- User B updates:
UPDATE items SET qty=0, ver=6 WHERE id='X' AND ver=5;
-- FAIL (0 rows updated) because version is now 6!
  • Pros: Fast. No locks.
  • Cons: User B gets an error (“Sorry, item sold out”).

6. Summary

  • Write Skew: A dangerous bug in Repeatable Read/Snapshot Isolation where concurrent checks pass but the final state is invalid.

6.1 The Deadlock Trap

When using strict isolation (like Serializable), you will eventually hit a Deadlock.

  • The Scenario: Txn A locks Row 1 and waits for Row 2. Txn B locks Row 2 and waits for Row 1.
  • The Hardware Conflict: Both CPU threads are at a standstill.
  • The Resolution: The database engine runs a background process to detect cycles in the Wait-For Graph. It “kills” the younger transaction (Rollback) to let the older one finish. Your application must be designed to retry!

Mnemonic for anomalies: “Dirty Phantoms Write Skewed Reality” — Dirty Read (prevented by Read Committed), Phantom Read (prevented by Serializable), Write Skew (prevented by SSI). Most production apps run at Read Committed. Serializable is for financial systems.

Staff Engineer Tip: Default to Read Committed, Use SELECT FOR UPDATE Surgically. Using Serializable everywhere has significant overhead and causes retry storms under contention. The correct approach: use Read Committed (PostgreSQL default) for 99% of operations, and upgrade specific critical sections to SELECT ... FOR UPDATE (pessimistic locking) or Optimistic Locking (version column) only where Write Skew is a real risk. Document every FOR UPDATE with a comment explaining the anomaly it prevents.