Dynamo: The Leaderless Revolution
[!TIP] Why this matters: Dynamo (2007) is the grandfather of modern NoSQL. It inspired Cassandra, Riak, and Voldemort. Understanding Dynamo means understanding Distributed Consistency and Availability at scale.
1. What is Dynamo?
In 2007, Amazon had a critical problem: The Shopping Cart service had to be Always Writeable.
- The Nightmare Scenario: It’s Black Friday. Traffic spikes 100x. A database node fails.
- The Constraint: You cannot stop a user from adding an item to their cart. “Sorry, we can’t take your money” is unacceptable.
- The Solution: Dynamo. A highly available, distributed key-value store that sacrifices strong consistency for availability (AP in CAP Theorem).
2. Requirements & Goals
Functional Requirements
- get(key): Returns a list of objects (due to potential conflicts) and a context.
- put(key, context, object): Writes the object to the store.
Non-Functional Requirements
- Always Writeable: Writes must succeed even if disks are failing or network partitions occur.
- Incremental Scalability: Add nodes one by one without downtime.
- Symmetry: No special “Master” node. All nodes are equal (Peer-to-Peer).
- Decentralization: No central coordinator.
3. Capacity Estimation (Context)
Dynamo was built to handle Amazon’s peak loads.
- Scale: Tens of millions of users.
- Requests: Millions of requests per second during peak.
- Latency: Measured at the 99.9th percentile (P99.9). Average latency is irrelevant; the slowest requests (stragglers) kill the user experience.
4. System APIs
Dynamo exposes a simple Key-Value interface.
// Read
List<Object> get(String key)
// Write
void put(String key, Context context, Object data)
- Context: Contains system metadata like the Vector Clock version, essential for merging conflicts later.
- Object: An opaque binary blob (Dynamo doesn’t care what’s inside).
5. Database Design
Dynamo is the database. It stores data as raw bytes.
- Primary Key: Hashed (MD5) to a 128-bit identifier.
- Value: Binary object (usually < 1MB).
6. High-Level Design: The Ring
Dynamo uses Consistent Hashing to distribute data across nodes arranged in a logical ring.
Physical Server A is mapped to positions A1, A2, etc. This ensures uniform data distribution and makes adding/removing nodes easier.
A key K is stored on the first node clockwise (B1), and automatically replicated to the next N-1 distinct physical nodes (C1, B2).
- The Ring (Hash Space): A logical circular space from 0 to 2128 - 1. Every key and every node is hashed onto this ring.
- Virtual Nodes (vNodes): Instead of one point on the ring, a physical server (e.g., Server A) is assigned multiple points (A1, A2). If a server fails, its load is distributed across many other physical servers, preventing a “cascade” failure.
- Clockwise Mapping: A key K is hashed into the ring and stored on the first node encountered moving clockwise (the Coordinator).
- Preference List (N=3): To ensure durability, the coordinator replicates the data to the next N-1 distinct physical nodes on the ring. In our diagram, if Key K lands before node B1, it is replicated to B1, C1, and B2.
7. Component Design: Tunable Consistency (N, R, W)
Dynamo allows you to configure trade-offs per operation using N, R, W.
- N: Number of replicas (the length of the Preference List shown in our diagram).
- W: Write Quorum (How many nodes must acknowledge a write).
- R: Read Quorum (How many nodes must respond to a read).
The Magic Formula:
If R + W > N, you have Strong Consistency (Quorum). If R + W ≤ N, you have Eventual Consistency (Risk of Stale Reads).
Interactive Decision Visualizer: Tunable Consistency
Adjust N, R, and W. Simulate a Write and a Read. See if you get the latest data or stale data.
8. Data Partitioning & Conflict Resolution
Since Dynamo is leaderless, writes can happen on any node. This leads to Conflicts.
- Scenario: User A adds “Book” to cart (v1). User B (on a different node) adds “DVD” to cart (v1).
- Result: Two versions of the cart exist.
- Vector Clocks: Dynamo uses Vector Clocks to track the “ancestry” of data.
- If Clock A < Clock B, B overwrites A.
- If A and B are concurrent (diverged), Dynamo keeps BOTH and asks the client to merge (e.g., “Book + DVD”).
Visualizer: Vector Clock Divergence
Simulate a network partition where two nodes update the same object independently.
9. Reliability, Caching, & Load Balancing
Hinted Handoff (Temporary Failure)
If Node A is down, Node B takes the write temporarily with a “Hint”: “This belongs to A”. When A comes back, B hands it over. This ensures writes rarely fail.
Merkle Trees (Permanent Failure)
To sync data between nodes efficiently (Anti-Entropy), Dynamo uses Merkle Trees (Hash Trees).
- Instead of comparing 1TB of data bit-by-bit, they compare Root Hashes.
- If Root Hashes match, data is identical.
- If not, they recurse down the tree to find the specific differing block (minimizing data transfer).
Gossip Protocol (Membership)
How do nodes know who is alive?
- They Gossip.
- Every second, a node picks a random peer and exchanges state: “I’m alive, and I heard Node C is dead”.
- Information propagates exponentially (O(log N)).
10. System Walkthrough: A Write Operation (Dry Run)
Let’s trace a request to save a user’s Shopping Cart.
- Key:
cart_123 - Value:
{"items": ["book"]} - Vector Clock:
[](Initial)
Step 1: Coordinator Selection
- Client sends
put(cart_123)to a Load Balancer. - LB forwards to any node in the cluster (e.g., Node X).
- Node X hashes the key:
MD5("cart_123") = 0x4a... - Node X looks up the Ring: This key belongs to Node A (Coordinator).
- Node X forwards the request to Node A.
Step 2: Replication (N=3, W=2)
- Node A writes to its local storage.
- Creates Vector Clock:
[A:1].
- Creates Vector Clock:
- Node A forwards the write to the next 2 nodes in the Preference List: Node B and Node C.
- Node B responds: “ACK”.
- Node C responds: “ACK”.
Step 3: Response
- Node A receives 2 ACKs (meeting W=2).
- Node A responds to Client:
200 OK.
Step 4: Conflict (The “Split Brain”)
- Network partition happens. Node A cannot talk to Node B.
- Client 2 talks to Node B to add “DVD”.
- Node B cannot talk to A, but it accepts the write (Always Writeable).
- Node B updates its Vector Clock:
[B:1]. - Result: We now have two versions:
[A:1](Book) and[B:1](DVD).
Step 5: Read Repair (Merge)
- Network heals. Client reads
cart_123. - Client receives both versions:
{ "val": ["book"], "vc": "[A:1]" } { "val": ["dvd"], "vc": "[B:1]" } - Client Logic merges them:
["book", "dvd"]. - Client sends
putwith new Vector Clock:[A:1, B:1]. - System converges.
11. Requirements Traceability Matrix
| Requirement | Design Choice |
|---|---|
| Always Writeable | Hinted Handoff + Tunable Consistency (W < N). |
| Incremental Scale | Consistent Hashing (vNodes) allows adding nodes with minimal reshuffling. |
| Decentralization | Leaderless Replication + Gossip Protocol. |
| Conflict Resolution | Vector Clocks (Client-side merging). |
| Performance (P99) | Zero-hop routing (Client library knows the ring) or Coordinator-based routing. |
| Durability | Replication (N=3) across distinct physical racks. |
12. Interview Gauntlet
- What happens if N < W?
- Impossible configuration. You can’t require more writes than replicas available.
- What is the difference between Vector Clocks and Lamport Timestamps?
- Lamport gives a total ordering (A happened before B). Vector Clocks detect concurrency (A and B happened at the same time).
- Why use Merkle Trees?
- To minimize bandwidth during data synchronization. We only transfer the hashes that differ.
- How does Dynamo handle “Hot Keys”?
- Consistent Hashing helps, but “Whale” keys are still an issue. Dynamo allows partitioning a single key across multiple nodes if needed (though rare in original design).
- What is “Sloppy Quorum”?
- If the Preference List nodes are down, Dynamo writes to any healthy node (Hinted Handoff), breaking the strict quorum membership to preserve availability.
- Why MD5?
- It’s fast and distributes uniformly. Collisions don’t matter much for distribution buckets.
- Push vs Pull Gossip?
- Dynamo uses both. Push to announce life, Pull to sync membership lists.
- What is the “Anti-Entropy” process?
- The background process that uses Merkle Trees to detect and fix data inconsistencies between replicas.
- Why P2P over Master-Slave?
- Master is a SPOF (Single Point of Failure) and a write bottleneck. P2P scales writes linearly.
- Explain “Read Repair”.
- When a client reads data, it fetches from N nodes. If one node returns stale data, the coordinator updates it immediately with the latest version.
13. Summary: The Whiteboard Strategy
1. Core Concepts
- Consistent Hashing: Ring topology with vNodes.
- Leaderless: Any node can coordinate a write.
- CAP Theorem: AP (Availability + Partition Tolerance).
- Gossip: Membership protocol.
2. Architecture Ring
A -> B -> C -> D
Replication: N=3 (A, B, C)
Request -> LB -> Coordinator
* Coordinator: First node in preference list.
* Sloppy Quorum: Write to D if A is down.
3. Handling Conflicts
[A:1] < [A:2] (Overwrite)
[A:1] vs [B:1] (Conflict -> Client Merge)
Read Repair: Fix stale replicas on read.
4. Trade-offs (N, R, W)
- R + W > N: Strong Consistency (Slow).
- R + W ≤ N: Eventual Consistency (Fast).
- Hinted Handoff: High Availability but risk of data loss if hinted node crashes before handoff.