Database Sharding

In 2012, Instagram was acquired by Facebook for $1 billion with 13 employees serving 30 million users. Their entire database was PostgreSQL — no sharding. After the acquisition, Facebook engineers discovered Instagram’s follower table had 300 million rows. When Bieber joined and got 1 million followers in 4 hours, the single disk’s I/O queue hit 100% utilization. The database team spent the next 3 months performing a live resharding of the followers table to Cassandra — migrating 300M rows while serving traffic with zero downtime. PostgreSQL to Cassandra, live, no outage. The migration technique they invented (Dual-Write) is now an industry standard. Sharding isn’t optional at scale — it’s a reckoning.

[!IMPORTANT] In this lesson, you will master:

  1. Horizontal Splits: Why RIP-ing your database into pieces is the only way to scale write-heavy workloads.
  2. Hotspot Hardware Intuition: How range-based sharding can saturate a single SSD’s I/O queue while others sit idle.
  3. The Resharding Storm: Quantifying the hardware cost (CPU + Network) of moving terabytes of data between nodes.

1. What is Sharding?

Sharding is Horizontal Scaling applied to a Database. Imagine your database is a 10,000-page phone book.

  • Vertical Scaling: Buy a thicker book (impossible after a certain point).
  • Sharding: Rip the book into 4 smaller volumes (A-G, H-M, N-S, T-Z) and give each volume to a separate librarian (Server).

This is a necessity for massive SQL Databases (like MySQL/Postgres) when they exceed the storage or write capacity of a single machine. NoSQL databases (Cassandra, DynamoDB) often shard automatically.

[!NOTE] Hardware-First Intuition: The “I/O Queue Depth” problem. In range-based sharding, a “Write Hotspot” doesn’t just fill up a disk; it saturates the Disk Controller’s I/O Queue. Even if the CPU is at 10%, if 50,000 requests hit the same SSD controller simultaneously, the Request Latency spikes because the hardware physically cannot process more than a few hundred concurrent I/O operations. This is why “Hash Sharding” is the preferred hardware strategy to scatter I/O across as many independent physical controllers as possible.

Sharding vs Partitioning: What’s the difference?

Though often used interchangeably, they are distinct:

Feature Sharding Partitioning
Scope Distributed across multiple servers (Horizontal) Logical split within one server (Vertical/Horizontal)
Goal Scale Write Throughput & Storage Improve Query Performance (Scan less data)
Complexity High (Network calls, Distributed Transactions) Low (Handled by DB engine)
Example Users 1-1M on DB1, 1M-2M on DB2 Sales table split by Year (2023, 2024)

2. Sharding Strategies

1. Key-Based (Hash) Sharding

This is the most common strategy for massive scale. Concept: Use a deterministic hash function to assign data. Formula: Shard_ID = Hash(Partition_Key) % Number_of_Shards

Example:

  • Hash(“User_123”) -> 9812
  • N = 4 Shards
  • Target = 9812 % 4 = Shard 0

Pros:

  • Uniform Distribution: Good hash functions (like MD5, MurmurHash) scatter keys randomly, preventing hotspots.
  • Simple Client Logic: The client (or proxy) can calculate the target shard without a lookup table.

Cons:

  • Resharding is Expensive: Changing N (e.g., 4 to 5) changes the modulo result for almost every key.
  • Result: You must move ≈80% of your data to new nodes. This is a “Resharding Storm”.
  • Fix: Use Consistent Hashing to minimize data movement.

2. Range-Based Sharding

You split data based on ranges of the key.

  • Shard 0: IDs 1 - 1,000,000
  • Shard 1: IDs 1,000,001 - 2,000,000

  • Pros: Efficient Range Queries. Scanning “Users created in Jan 2024” hits only one shard.
  • Cons: Write Hotspots. If IDs are sequential (Auto-Increment), all new traffic hits the last shard (Shard 1), while Shard 0 sits idle.

3. Directory-Based Sharding

A central “Lookup Service” stores the mapping: User_123 -> Shard_A.

  • Pros: Flexible. You can move individual tenants/users without moving others.
  • Cons: The Lookup Service becomes a Single Point of Failure (SPOF) and a performance bottleneck.

4. Geo-Sharding (Data Residency)

You shard based on the user’s location.

  • Shard US: Stores USA users.
  • Shard EU: Stores European users. Why?
    1. Latency: Data is close to the user.
    2. Compliance: Laws like GDPR often require German user data to physically stay in Germany.

3. Interactive Demo: Hash vs Range

Visualize how data lands on servers.

  1. Hash Mode: Type “user1”, “user2”. Notice they scatter randomly.
  2. Range Mode: Type numbers “10”, “50”, “150”. Notice they group together.
  3. Reshard: Change the shard count and watch the Resharding Storm (Hash Sharding only).
STRATEGY
Select a strategy to start.

4. Deep Dive: Virtual Buckets

The main problem with standard Sharding is Hot Partitions. If you shard by User ID, and User ID 1 is “Justin Bieber”, Shard 1 will melt. You can’t easily move just Justin Bieber to a new shard because the hash function Hash(1) % N is rigid.

Solution: Virtual Buckets (used by Couchbase, Cassandra). Instead of mapping Key -> Node, you map Key -> Bucket -> Node.

  1. Map Key to Bucket: Hash(Key) % 1024 Buckets. (Fixed).
  2. Map Bucket to Node: Bucket_Table[Bucket_ID] = Node_IP. (Flexible).

Example:

  • Keys 1-100 map to Bucket 1.
  • Bucket 1 maps to Node A.
  • If Bucket 1 gets hot, you can change the Bucket_Table to point Bucket 1 to Node B (a larger server).
  • You move the data in Bucket 1 to Node B.
  • Result: You solved the hot spot without changing the Hash Function or reshuffling the other 1023 buckets.

5. The Hidden Costs of Sharding

Sharding is not a “Free Lunch”. It introduces massive complexity.

1. Cross-Shard Joins

Imagine you shard Users by ID. SELECT * FROM Orders JOIN Users ON Orders.user_id = Users.id If Orders and Users are on different machines, the database cannot do this join efficiently. Solution:

  1. Application Join: Fetch User from Shard A, then fetch Orders from Shard B. Combine in your app.
  2. Denormalization: Store user_name inside the Orders table so you don’t need to join.

2. Distributed Transactions

Maintaining ACID properties across shards is hard. If you deduct money from User A (Shard 1) and add to User B (Shard 2), you need Two-Phase Commit (2PC), which is slow and complex.


6. Shard Key Selection: The Most Critical Decision

Choosing the wrong shard key can cripple your system. Here’s how to choose wisely.

Decision Matrix

Shard Key Distribution Range Queries Hot Shards Risk Best For
User ID (Hash) Even ❌ No Low User-centric apps (Twitter, Instagram)
Timestamp ❌ Skewed ✅ Yes [!] HIGH Append-only logs (avoid as primary key)
Geography Even ✅ Yes Medium Multi-region (Uber, Airbnb)
Tenant ID Varies ✅ Yes [!] HIGH B2B SaaS (large tenant imbalance)
Composite (User+Time) Even Partial Low Activity feeds, messaging

Anti-Patterns to Avoid

❌ 1. Monotonic IDs as Shard Key

Problem: Auto-increment IDs send ALL new writes to the last shard.

Shard 0: IDs 1-1M     (Cold, no writes)
Shard 1: IDs 1M-2M    (Cold, no writes)
Shard 2: IDs 2M-3M    ([!] HOT, all writes)

Solution: Use hash of ID or UUID.

❌ 2. The Celebrity Problem

Problem: Even with perfect hashing, some keys are just naturally huge.

  • Justin Bieber on Twitter: Has 100M followers.
  • Shard Key: user_id.
  • Scenario: Justin tweets.
  • Fanout: The system must insert 100M rows into the “Home Timeline” table.
  • Impact: If all those rows land on Shard X (because Hash(Justin) -> X), Shard X dies.

Solution:

  • Isolate: Put Justin Bieber on a dedicated shard.
  • Scatter: For the timeline, don’t store “Justin’s Tweet ID” 100M times. Store it once, and make followers fetch it from him (Pull Model).

Resharding Strategies

When you must add shards to an existing system:

1. Stop-the-World (Downtime)

  • Take database offline
  • Re-hash and move data
  • Bring back online
  • Cost: Hours of downtime
  • Use: Small databases only

2. Dual-Write (Zero Downtime)

Phase 1: Write to OLD + NEW shards simultaneously
Phase 2: Background job migrates old data to NEW
Phase 3: Switch reads to NEW shards
Phase 4: Remove OLD shards
  • Cost: Complex application logic
  • Use: Production systems (Facebook, Instagram)

3. Vitess / Citus Automation

  • Use managed sharding layer
  • Vitess (YouTube) or Citus (Postgres) handle resharding
  • Cost: Additional infrastructure
  • Use: Large-scale MySQL/Postgres

Interview Tip: Don’t shard until you have to. Start with Read Replicas and Caching. Sharding is the “nuclear option” because of the operational complexity it adds.

Staff Engineer Tip: The Ghost Write Hazard. When using Dual-Write for zero-downtime resharding, you must handle the order of operations strictly. If your background migration job overwrites a fresh write from the application, you create “Ghost Data”. The solution is to use Version Stamps or Vector Clocks. The NEW shard should only accept a background migration write if the row version is higher than what is already there.

Mnemonic — “Hash for Writes, Range for Reads”: Choose Hash Sharding when write throughput is priority (uniform I/O distribution). Choose Range Sharding when range queries are priority (time-series, analytics). For Celebrity/Hot Keys: use Virtual Buckets (key → bucket → node) for fine-grained mobility. Order of attack: 1) Read Replicas → 2) Caching → 3) Sharding (nuclear option last).

7. Case Study: Sharding a High-Concurrency Ticket Booking System (PEDALS Framework)

P - Process Requirements

Goal: Design the database layer for a system like TicketMaster handling global concert ticket sales.

  • Core Action: Users search for events, view available seats, and book tickets.
  • Constraints: Flash sales generate massive bursts of traffic. Double-booking is catastrophic.

E - Estimate

  • Traffic: 100,000 queries per second (QPS) during a flash sale (e.g., Taylor Swift tickets).
  • Storage: Moderate. Event metadata is small, but transaction logs and user reservations grow.
  • Read/Write Ratio: 100:1 (Thousands check availability, few successfully book).

D - Data Model

We need absolute ACID guarantees for financial transactions, making SQL (PostgreSQL/MySQL) the natural choice.

  • Events Table: event_id, venue_id, date
  • Tickets Table: ticket_id, event_id, seat_number, status (Available, Locked, Booked)

A - Architecture

A single PostgreSQL instance maxes out around 10k-20k writes/sec. We need to handle 100k+ QPS. We must shard the Tickets table.

  • Primary Shard Key Decision: We cannot shard by User ID, because multiple users will try to book the same event simultaneously, causing cross-shard transactions if we need to lock seats.
  • Better Shard Key: Shard by Event ID.
    • Advantage: All tickets for a specific event live on the same database node. A single node can use standard local ACID row-level locks to prevent double-booking. No distributed transactions (2PC) needed.

L - Localized Details (Handling the Hotspot)

The Celebrity Problem: Sharding by Event ID means a Taylor Swift concert (Event 99) sends 100% of its traffic to Shard 3, instantly melting it.

  • Solution: We must introduce Hybrid Sharding (Composite Key) or an In-Memory Locking Queue.
    • We can further shard hot events by Event ID + Venue Section. This distributes the locks across multiple nodes. (e.g., Floor seats on Node A, Balcony on Node B).

S - Scale

As the system scales to handle multiple concurrent global events, we utilize Virtual Buckets.

  • Instead of hashing Event ID directly to a physical server, we hash it to one of 1,024 Virtual Buckets.
  • We run a daemon monitoring node health. If the physical node holding the Taylor Swift virtual bucket approaches 90% CPU, the system automatically triggers a dynamic routing update, isolating that virtual bucket onto a massive, dedicated read-replica heavy hardware cluster just for the duration of the 15-minute flash sale.