Chubby (Distributed Lock Service)
[!TIP] The Referee: In distributed systems, nodes constantly disagree. “I’m the Master!” “No, I am!” Chubby is the arbiter. It uses Paxos to provide a consistent source of truth. It is the spiritual ancestor of ZooKeeper and etcd.
1. The Problem: Who is in Charge?
In GFS and BigTable, we have a Single Master. But what if that machine dies?
- We need to elect a new Master.
- We need to ensure the old Master knows it’s fired (so it doesn’t corrupt data).
- Split Brain: The nightmare scenario where two nodes both think they are Master and both write to the disk.
2. Coarse-Grained Locking
Chubby is optimized for Coarse-Grained locks (held for hours/days), not Fine-Grained locks (seconds).
- Why? Running Paxos for every database transaction is too slow.
- Pattern: Clients use Chubby to elect a Leader (heavy operation, infrequent), and then the Leader manages the data (fast, frequent).
3. Architecture: The Cell
A Chubby Cell is a fault-tolerant cluster that uses Paxos to maintain a consistent filesystem state across replicas, as visualized in our architecture diagram.
- 5 Replicas: A typical cell has 5 nodes to ensure quorum (3 nodes) for Paxos consensus.
- Chubby Master: All client RPCs (reads and writes) go to the Master. The Master acts as the Paxos Proposer.
- Strict Consistency & Invalidation: To ensure all clients see the same data, the Master implements a Cache Invalidation flow (red arrows):
- Write Request: A client sends a write.
- Invalidate: Before committing, the Master sends invalidation signals to all clients caching that file.
- Ack: The Master waits for all acknowledgements (gray dots) before allowing the write to proceed.
- Trade-off: This makes writes slow but allows for instant, local reads for the majority of clients.
Interactive: Cache Invalidation Flow
4. Sessions and KeepAlives
Chubby uses Sessions and a KeepAlive mechanism (visualized as the heartbeat pulse) to manage client health.
- Handshake: Client connects, gets a Session ID.
- KeepAlive: Client sends a heartbeat every few seconds.
- Jeopardy: If the Master doesn’t reply (maybe Master crashed), the client enters “Jeopardy”. It waits for a Grace Period (e.g., 45s).
- If a new Master is elected within 45s, the session is saved.
- If not, the session (and all locks) are lost.
5. Preventing Split Brain: Sequencers
The most dangerous problem in distributed systems is Split Brain.
- The Scenario: Client A acquires the lock. It freezes (Stop-the-World GC) for 1 minute.
- The Result: Chubby thinks A is dead. It gives the lock to Client B.
- The Conflict: Client A wakes up, thinking it still owns the lock, and tries to write to the database (GFS). If GFS accepts it, data is corrupted.
The Solution: Fencing Tokens (Sequencers)
- Chubby attaches a Monotonic Version Number to every lock (e.g., Token 10).
- When Client B takes the lock, the version increments (Token 11).
- When Client A (Zombie) tries to write with Token 10, the storage system (GFS) checks:
if (10 < 11) REJECT.
Interactive Demo: Fencing the Zombie
Visualize a Zombie Client trying to write with an old token.
- Storage: Tracks the Highest Token seen (Current: 10).
- Client B: Active Leader (Token 11).
- Client A: Zombie Leader (Token 10).
6. Consensus Deep Dive: Paxos vs Raft
Chubby (and Google) famously uses Paxos, while modern open-source systems like etcd use Raft.
Why Raft?
Paxos is notoriously difficult to understand and implement correctly. Raft was designed specifically to be Understandable.
- Leader Election: Raft has a strong leader concept built-in. Paxos can run leaderless (but usually doesn’t).
- Log Replication: Raft enforces a stricter log consistency model (Leader appends only).
Comparison Table
| Feature | Chubby (Paxos) | ZooKeeper (ZAB) | etcd (Raft) | |:——–|:——–|:——–|:——–| | Algorithm | Paxos (Complex) | ZAB (Atomic Broadcast) | Raft (Understandable) | | Abstraction | File System (Files) | Directory (ZNodes) | Key-Value Store | | Caching | Heavy (Client Invalidation) | None (Read from Replicas) | None (Read from Leader/Follower) | | Consistency | Strict (Linearizable) | Sequential (Can be stale) | Strict (Linearizable) |
7. Observability & Tracing
Chubby is critical infrastructure. If it goes down, Google goes down.
RED Method
- Rate: KeepAlive RPCs/sec.
- Errors: Session Lost count. This is a critical alert.
- Duration: Paxos Round Latency (Commit Time). High latency means disk issues on the Master.
Health Checks
- Quorum Size: Alert if < 3 replicas are alive.
- Snapshot Age: Alert if DB snapshot is older than 1 hour.
8. Deployment Strategy
Updating the Lock Service is scary. We use Cell Migration.
- New Binary: Deploy to 1 replica.
- Wait: Ensure it rejoins quorum and catches up.
- Repeat: Deploy to remaining replicas one by one.
- Master Failover: Force a master election to ensure the new binary can lead.
9. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Consistency | Paxos Consensus Algorithm. |
| Availability | 5-Node Cell (survives 2 failures) + Master Failover. |
| Scalability | Heavy Client-Side Caching (Invalidation). |
| Correctness | Fencing Tokens to prevent Split Brain. |
10. Interview Gauntlet
I. Consensus
- Why Paxos? To agree on values (lock ownership) even if nodes fail.
- Why 5 nodes? To survive 2 failures. 3 nodes only survive 1.
- What is a Proposer? The node that suggests a value. In Chubby, only the Master proposes.
II. Caching
- Why Invalidation? To ensure Strict Consistency. A client cannot read stale data.
- What if a client misses an Invalidation? The Master waits for Acks. If a client doesn’t ack, the Master drops its session.
III. System Design
- Can I use Chubby for a high-speed queue? No. Writes are slow (Paxos). Use Kafka.
- What is “Jeopardy”? A state where the client has lost contact with the Master but the Session is still valid (Grace Period).
11. Summary: The Whiteboard Strategy
If asked to design ZooKeeper/Chubby, draw this 4-Quadrant Layout:
1. Requirements
- Func: Lock, Elect Leader, Store Config.
- Non-Func: CP (Consistency), High Reliability.
- Scale: Small Data, Read Heavy.
2. Architecture
|
[Master (Proposer)]
/ | \
[Replica] [Replica] [Replica]
* Consensus: Paxos/Raft.
* Session: KeepAlives.
3. Metadata
Content: "10.0.0.1:8080"
Stat: {version: 10, ephemeral: true}
4. Reliability
- Split Brain: Fencing Tokens (Sequencers).
- Performance: Client Caching.
- Availability: Session Grace Periods.
This concludes the Deep Dive into Infrastructure. Now, let’s learn how to operate these monsters in Module 17: Ops Excellence.