Cassandra: The Best of Both Worlds
[!TIP] Hybrid DNA: Cassandra was born at Facebook (2008) for their Inbox Search. It took the Data Model from Google BigTable (Wide Column) and the Distributed Architecture from Amazon Dynamo (Leaderless Ring).
1. What is Cassandra?
Cassandra is a Wide Column Store. It is designed for handling massive amounts of data across many commodity servers, providing high availability via a decentralized Gossip Mesh (as shown in our architecture diagram).
- Write-Optimized: It can handle millions of writes per second using a sequential pipeline.
- Linearly Scalable: Doubling nodes doubles throughput.
- Tunable Consistency: You decide per-query if you want Speed (AP) or Consistency (CP).
2. Requirements & Goals
- High Availability: No master node. Any node can handle any request.
- Partition Tolerance: Works across multiple data centers (Multi-Region Active-Active).
- Eventual Consistency: Data will eventually sync across all replicas.
- Flexible Schema: Add columns on the fly.
3. Data Model: The Wide Column
It’s not a relational database, and it’s not just a Key-Value store.
- Think: A 2-Dimensional Key-Value store.
- Row Key (Partition Key): Determines which node holds the data.
- Clustering Key (Sort Key): Determines the order of data within that node.
Visualizing the Data Model
Unlike SQL tables (fixed columns), a Wide Column row is a sorted map of columns.
Logical View:
// Partition Key: "User_123"
{
"User_123": {
// Clustering Key (Time): Column Value
"2023-01-01 10:00": "Hello World",
"2023-01-01 10:05": "How are you?",
"2023-01-02 09:00": "Good Morning"
}
}
4. System APIs: CQL
Cassandra uses CQL (Cassandra Query Language), which looks like SQL but behaves differently.
Primary Key Anatomy
The PRIMARY KEY is composed of:
- Partition Key: Responsible for data distribution (Consistent Hashing).
- Clustering Key: Responsible for data sorting (on disk).
| Feature | Partition Key | Clustering Key |
|---|---|---|
| Purpose | Distribution (Which Node?) | Sorting (Which Order?) |
| Analogy | The “File Cabinet” | The “Folder” inside the cabinet |
| Example | channel_id |
created_at |
| Query Rule | MUST be provided for fast lookups | Can be used for Range Scans |
CREATE TABLE messages (
channel_id uuid, -- Partition Key
created_at timestamp, -- Clustering Key
message_id uuid,
content text,
PRIMARY KEY ((channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
[!WARNING] Query Anti-Pattern:
SELECT * FROM messages WHERE created_at > '2023-01-01'without achannel_id. This forces Cassandra to scan ALL nodes (Scatter-Gather), which is incredibly slow. Always query by Partition Key first!
5. Cassandra Architecture
Distributed Architecture: Peer-to-Peer Mesh and Local Storage Pipeline.
{K:102, V: "Msg1"}
{K:405, V: "Msg2"}
Filter
6. The Write Path: LSM Tree (Log-Structured Merge Tree)
Cassandra is optimized for Heavy Writes. It never overwrites data in place (unlike MySQL/B-Trees). This is achieved through the Internal Storage Pipeline visualized above:
- Commit Log: Append-only file on disk for durability.
- MemTable: In-memory sorted structure (RAM).
- Flush: When MemTable is full, it flushes to disk as an SSTable.
- SSTable: Immutable disk files.
Why is this fast?
- No random I/O. Everything is a Sequential Write.
- Disk heads don’t seek. They just spin and write.
Interactive Demo: LSM Flush & Compaction
Simulate writes filling the Memory. Flush to Disk. Then run Compaction to merge SSTables.
6. The Read Path & Reliability
Reading is harder than writing because data might be fragmented.
- Check MemTable: Is it in RAM?
- Check Row Cache: Is the row cached?
- Check Bloom Filter: Use the bit-array guard (seen in the diagram) to skip SSTables that definitely don’t have the key.
- Check SSTables: Read relevant files and Merge the data (Latest timestamp wins).
This is why Cassandra reads are slower than writes.
Compaction Strategies
Over time, you have 1000s of SSTables. Reads become slow. Compaction merges these files.
| Strategy | Use Case | Description |
|---|---|---|
| Size Tiered (STCS) | Write-Heavy | Merges 4 smalls into 1 medium. Good for inserts. High space overhead. |
| Leveled (LCS) | Read-Heavy | Keeps data sorted in “Levels” (L0, L1…). Guarantees a key exists in only ONE file per level. Fast reads. |
| Time Window (TWCS) | Time Series | Groups data by time buckets (e.g., 1 day). Old buckets expire cheaply. |
The Mystery of Deletion (Tombstones)
In LSM Trees, you cannot “delete” a line from an immutable SSTable file.
- Deletion: Instead of deleting, Cassandra writes a Tombstone (a marker saying “Key X is Dead”).
- Read: If you read X, Cassandra sees the Tombstone (with a newer timestamp) and returns “Not Found”.
- Compaction: The data is actually deleted only when Compaction runs and merges the Tombstone with the original data.
Visualizing Compaction & Tombstones:
graph TD
subgraph "Before Compaction"
A[SSTable 1: {Key: A, Val: 10}]
B[SSTable 2: {Key: A, Val: Tombstone}]
end
A --> C{Compaction Process}
B --> C
C --> D[New SSTable: {Key: A is GONE}]
style B fill:#f85149,color:white
style D fill:#238636,color:white
7. Trade-offs: Write vs Read Amplification
In Storage Engines, you can’t have it all.
| Metric | Definition | High in… |
|---|---|---|
| Write Amplification | 1 user write = X disk writes | B-Trees (Random updates) or Leveled Compaction (Constant re-sorting). |
| Read Amplification | 1 user read = Y disk reads | LSM Trees (Must check MemTable + multiple SSTables). |
| Space Amplification | Data stored / Actual data size | Size Tiered Compaction (Needs 50% temp space). |
8. System Walkthrough: The Life of a Query
A Write Operation (INSERT)
Query: INSERT INTO messages (id, content) VALUES ('uuid-1', 'Hello');
- Coordinator Node: Client hits any node. This node acts as Coordinator.
- Partitioner: Hashes
uuid-1to find the replicas (e.g., Node A, Node B, Node C). - Local Write (Node A):
- Append to CommitLog (Disk).
- Add to MemTable (RAM).
- Replication: Coordinator sends write to Node B and C concurrently.
- Quorum Check: If
Consistency=QUORUM(N=3), wait for 2 ACKs. - Response: Return
200 OKto client.
A Read Operation (SELECT)
Query: SELECT * FROM messages WHERE id='uuid-1';
- Coordinator Node: Client hits any node.
- Partitioner: Finds replicas (A, B, C).
- Read Repair: Send “Full Read” request to closest node (A) and “Digest Read” (Hash check) to others (B, C).
- Local Read (Node A):
- Check MemTable. (Found? Return).
- Check Bloom Filter. (Maybe in SSTable 1, 2?).
- Scan SSTable 1 and SSTable 2.
- Merge results (Latest timestamp wins).
- Response: Return data to Coordinator.
- Consistency Check: Coordinator compares data from A with Hash Digests from B and C.
- If mismatch, trigger Read Repair (write latest data to stale nodes).
- Final Response: Return data to Client.
9. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Massive Writes | LSM Tree (Sequential Writes) + CommitLog. |
| High Availability | Leaderless Replication + Gossip Protocol. |
| Multi-Region | NetworkTopologyStrategy (Aware of Data Centers). |
| Flexible Schema | Wide Column Model (Map<Col, Val>). |
| Tunable Consistency | Per-query CL=ONE/QUORUM/ALL. |
| Disk Failure | Replication Factor (RF=3) prevents data loss. |
10. Interview Gauntlet
- Why are Cassandra writes faster than reads?
- Writes are append-only (Sequential I/O). Reads requires merging data from MemTable and multiple SSTables (Random I/O).
- What is a Tombstone?
- A marker that indicates a row is deleted. It suppresses older data during reads.
- What happens if you have too many Tombstones?
- Reads slow down (Cassandra must scan past them). “Doomstone” error occurs if scanning > 100k tombstones.
- Explain Leveled Compaction (LCS).
- Optimized for reads. Creates “Levels” where each key exists in exactly one SSTable per level.
- How does Cassandra handle “Hot Partitions”?
- Poorly. You must design your Partition Key to spread data evenly. If a partition gets too big (>100MB), performance degrades.
- Why use a Bloom Filter?
- To avoid hitting disk for keys that don’t exist in an SSTable.
- What is the “Coordinator Node”?
- The node that receives the client request. It acts as a proxy for that specific request.
- Consistency Level: ONE vs QUORUM?
- ONE: Fast, High Availability, potential Stale Read. QUORUM: Strong Consistency, lower availability (needs majority).
- What is “Hinted Handoff”?
- If a replica is down, the Coordinator stores the write locally (“hint”) and replays it when the replica returns.
- Does Cassandra support ACID transactions?
- No (mostly). It supports “Lightweight Transactions” (LWT) using Paxos (CAS), but they are slow.
11. Summary: The Whiteboard Strategy
1. Core Concepts
- Wide Column: 2D Map (Row -> Col -> Val).
- LSM Tree: Log-Structured Merge Tree (Write Heavy).
- Gossip: Peer-to-Peer state exchange.
- Tunable: AP or CP per request.
2. Write Path (LSM)
|
v
MemTable (RAM)
| (Flush)
v
SSTables (Disk, Immutable)
* Compaction: Merges SSTables to clean up.
* Bloom Filter: Skips irrelevant SSTables.
3. Primary Key
Partition Key: Which Node? (Hashing)
Clustering Key: Which Order? (Sorting)
4. Critical Gotchas
- Tombstones: Deletes are expensive (markers).
- Partition Sizing: Keep partitions < 100MB.
- Query Model: Must query by Partition Key. No Joins.