The Structures That Hold The World

When you do SELECT * FROM users WHERE id = 5, what happens? The database doesn’t scan the whole disk. It uses a Tree. But not a Binary Tree.

1. The Hardware Constraint: Disk I/O

  • RAM Access: Nanoseconds.
  • Disk Access: Milliseconds (1,000,000x slower).
  • Block Size: Disks read data in blocks (e.g., 4KB or 16KB).

A standard Binary Tree spreads nodes across memory. Reading Node -> Left -> Left might trigger 3 disk seeks. Too slow.

2. B-Trees and B+Trees

A B-Tree is a “fat” tree. Each node contains many keys (e.g., 100) and many children.

  • Height: Very low. A height-3 B-Tree can store millions of rows.
  • Locality: One node = One Disk Page. Reading a node gets 100 keys in 1 seek.

B+Tree (The Standard)

Used by MySQL (InnoDB) and PostgreSQL.

  • Internal Nodes: Only keys (for routing).
  • Leaf Nodes: Actual Data. Linked together for fast range scans.

[!NOTE] The Math of Fanout: If a B+Tree has a branching factor (fanout) of 100, a tree of height 3 can store $100^3 = 1,000,000$ keys. This means finding any record out of a million takes at most 3 disk seeks!

3. Interactive: B+Tree Split

Visualize how a node “splits” when full, pushing the median up.

Tree Empty.

4. LSM Trees (Log-Structured Merge)

Used by Cassandra, RocksDB, Kafka.

  • Problem: B-Trees do random writes (slow on HDD/SSD). If you have millions of writes per second, B-Trees become a bottleneck.
  • Solution: Write sequentially to an append-only log.

Anatomy of an LSM Tree Read/Write

  1. Write-Ahead Log (WAL): Every write is immediately appended to a WAL on disk for durability (crash recovery).
  2. MemTable: The write is then stored in an in-memory balanced tree (like a Red-Black tree).
  3. Flush to SSTable: When the MemTable gets full (e.g., a few Megabytes), it is flushed sequentially to disk as an immutable SSTable (Sorted String Table).
  4. Compaction: Over time, you accumulate many SSTables. A background process periodically merges and sorts them into larger SSTables, removing deleted keys (Tombstones).

[!TIP] War Story: Discord’s Migration Discord originally stored billions of messages in MongoDB (B-Tree). As they grew, random writes crippled performance. They migrated to Cassandra (LSM Tree) because appending to a sequential log is orders of magnitude faster than updating random B-Tree nodes for their heavily write-skewed workload.

How do we Read Fast? (Bloom Filters)

Since a read might have to check the MemTable and multiple SSTables, LSM Trees use Bloom Filters (a probabilistic data structure). The Bloom Filter can tell us instantly: “This key is definitely not in this SSTable,” saving us a costly disk read.

B+Tree vs LSM Tree Comparison

Feature B+Tree (MySQL, Postgres) LSM Tree (Cassandra, RocksDB)
Optimized For Read-heavy workloads Write-heavy workloads
Write Pattern Random Writes (in-place updates) Sequential Writes (append-only)
Read Speed Very Fast ($O(\log N)$ seeks) Slower (may check multiple SSTables)
Space Amplification High (fragmentation in pages) Low (Compaction cleans it up)