Cassandra: The Best of Both Worlds

[!TIP] Hybrid DNA: Cassandra was born at Facebook (2008) for their Inbox Search. It took the Data Model from Google BigTable (Wide Column) and the Distributed Architecture from Amazon Dynamo (Leaderless Ring).

1. What is Cassandra?

Cassandra is a Wide Column Store. It is designed for handling massive amounts of data across many commodity servers, providing high availability via a decentralized Gossip Mesh (as shown in our architecture diagram).

Write-Optimized: It can handle millions of writes per second using a sequential pipeline.
Linearly Scalable: Doubling nodes doubles throughput.
Tunable Consistency: You decide per-query if you want Speed (AP) or Consistency (CP).

2. Requirements & Goals

High Availability: No master node. Any node can handle any request.
Partition Tolerance: Works across multiple data centers (Multi-Region Active-Active).
Eventual Consistency: Data will eventually sync across all replicas.
Flexible Schema: Add columns on the fly.

3. Data Model: The Wide Column

It’s not a relational database, and it’s not just a Key-Value store.

Think: A 2-Dimensional Key-Value store.
Row Key (Partition Key): Determines which node holds the data.
Clustering Key (Sort Key): Determines the order of data within that node.

Visualizing the Data Model

Unlike SQL tables (fixed columns), a Wide Column row is a sorted map of columns.

Logical View:

// Partition Key: "User_123"
{
  "User_123": {
    // Clustering Key (Time): Column Value
    "2023-01-01 10:00": "Hello World",
    "2023-01-01 10:05": "How are you?",
    "2023-01-02 09:00": "Good Morning"
  }
}

4. System APIs: CQL

Cassandra uses CQL (Cassandra Query Language), which looks like SQL but behaves differently.

Primary Key Anatomy

The PRIMARY KEY is composed of:

Partition Key: Responsible for data distribution (Consistent Hashing).
Clustering Key: Responsible for data sorting (on disk).

Feature	Partition Key	Clustering Key
Purpose	Distribution (Which Node?)	Sorting (Which Order?)
Analogy	The “File Cabinet”	The “Folder” inside the cabinet
Example	`channel_id`	`created_at`
Query Rule	MUST be provided for fast lookups	Can be used for Range Scans

CREATE TABLE messages (
    channel_id uuid,        -- Partition Key
    created_at timestamp,   -- Clustering Key
    message_id uuid,
    content text,
    PRIMARY KEY ((channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

[!WARNING] Query Anti-Pattern: SELECT * FROM messages WHERE created_at > '2023-01-01' without a channel_id. This forces Cassandra to scan ALL nodes (Scatter-Gather), which is incredibly slow. Always query by Partition Key first!

5. Cassandra Architecture

Distributed Architecture: Peer-to-Peer Mesh and Local Storage Pipeline.

Architecture: Cassandra Distributed Engine

Leaderless Gossip Mesh | LSM-Tree Storage | Sequential Write Pipeline

Cluster Node

Write Consistency Path

SSTable (Immutable)

Internal Storage Pipeline (LSM-Tree)

Append-Only Durability

(RAM)

                    Sorted List:

                    {K:102, V: "Msg1"}

                    {K:405, V: "Msg2"}

Bloom
Filter

(Immutable)

SSTable 2

(Compacted)

6. The Write Path: LSM Tree (Log-Structured Merge Tree)

Cassandra is optimized for Heavy Writes. It never overwrites data in place (unlike MySQL/B-Trees). This is achieved through the Internal Storage Pipeline visualized above:

Commit Log: Append-only file on disk for durability.
MemTable: In-memory sorted structure (RAM).
Flush: When MemTable is full, it flushes to disk as an SSTable.
SSTable: Immutable disk files.

Why is this fast?

No random I/O. Everything is a Sequential Write.
Disk heads don’t seek. They just spin and write.

Interactive Demo: LSM Flush & Compaction

Simulate writes filling the Memory. Flush to Disk. Then run Compaction to merge SSTables.

MemTable (RAM)

0 / 5 (Threshold)

➡

Disk (SSTables)

Ready.

6. The Read Path & Reliability

Reading is harder than writing because data might be fragmented.

Check MemTable: Is it in RAM?
Check Row Cache: Is the row cached?
Check Bloom Filter: Use the bit-array guard (seen in the diagram) to skip SSTables that definitely don’t have the key.
Check SSTables: Read relevant files and Merge the data (Latest timestamp wins).

This is why Cassandra reads are slower than writes.

Compaction Strategies

Over time, you have 1000s of SSTables. Reads become slow. Compaction merges these files.

Strategy	Use Case	Description
Size Tiered (STCS)	Write-Heavy	Merges 4 smalls into 1 medium. Good for inserts. High space overhead.
Leveled (LCS)	Read-Heavy	Keeps data sorted in “Levels” (L0, L1…). Guarantees a key exists in only ONE file per level. Fast reads.
Time Window (TWCS)	Time Series	Groups data by time buckets (e.g., 1 day). Old buckets expire cheaply.

The Mystery of Deletion (Tombstones)

In LSM Trees, you cannot “delete” a line from an immutable SSTable file.

Deletion: Instead of deleting, Cassandra writes a Tombstone (a marker saying “Key X is Dead”).
Read: If you read X, Cassandra sees the Tombstone (with a newer timestamp) and returns “Not Found”.
Compaction: The data is actually deleted only when Compaction runs and merges the Tombstone with the original data.

Visualizing Compaction & Tombstones:

graph TD
    subgraph "Before Compaction"
        A[SSTable 1: {Key: A, Val: 10}]
        B[SSTable 2: {Key: A, Val: Tombstone}]
    end

    A --> C{Compaction Process}
    B --> C

    C --> D[New SSTable: {Key: A is GONE}]

    style B fill:#f85149,color:white
    style D fill:#238636,color:white

7. Trade-offs: Write vs Read Amplification

In Storage Engines, you can’t have it all.

Metric	Definition	High in…
Write Amplification	1 user write = X disk writes	B-Trees (Random updates) or Leveled Compaction (Constant re-sorting).
Read Amplification	1 user read = Y disk reads	LSM Trees (Must check MemTable + multiple SSTables).
Space Amplification	Data stored / Actual data size	Size Tiered Compaction (Needs 50% temp space).

8. System Walkthrough: The Life of a Query

A Write Operation (INSERT)

Query: INSERT INTO messages (id, content) VALUES ('uuid-1', 'Hello');

Coordinator Node: Client hits any node. This node acts as Coordinator.
Partitioner: Hashes uuid-1 to find the replicas (e.g., Node A, Node B, Node C).
Local Write (Node A):
- Append to CommitLog (Disk).
- Add to MemTable (RAM).
Replication: Coordinator sends write to Node B and C concurrently.
Quorum Check: If Consistency=QUORUM (N=3), wait for 2 ACKs.
Response: Return 200 OK to client.

A Read Operation (SELECT)

Query: SELECT * FROM messages WHERE id='uuid-1';

Coordinator Node: Client hits any node.
Partitioner: Finds replicas (A, B, C).
Read Repair: Send “Full Read” request to closest node (A) and “Digest Read” (Hash check) to others (B, C).
Local Read (Node A):
- Check MemTable. (Found? Return).
- Check Bloom Filter. (Maybe in SSTable 1, 2?).
- Scan SSTable 1 and SSTable 2.
- Merge results (Latest timestamp wins).
Response: Return data to Coordinator.
Consistency Check: Coordinator compares data from A with Hash Digests from B and C.
- If mismatch, trigger Read Repair (write latest data to stale nodes).
Final Response: Return data to Client.

9. Requirements Traceability Matrix

Requirement	Architectural Solution
Massive Writes	LSM Tree (Sequential Writes) + CommitLog.
High Availability	Leaderless Replication + Gossip Protocol.
Multi-Region	NetworkTopologyStrategy (Aware of Data Centers).
Flexible Schema	Wide Column Model (Map<Col, Val>).
Tunable Consistency	Per-query `CL=ONE/QUORUM/ALL`.
Disk Failure	Replication Factor (RF=3) prevents data loss.

10. Interview Gauntlet

Why are Cassandra writes faster than reads?
- Writes are append-only (Sequential I/O). Reads requires merging data from MemTable and multiple SSTables (Random I/O).
What is a Tombstone?
- A marker that indicates a row is deleted. It suppresses older data during reads.
What happens if you have too many Tombstones?
- Reads slow down (Cassandra must scan past them). “Doomstone” error occurs if scanning > 100k tombstones.
Explain Leveled Compaction (LCS).
- Optimized for reads. Creates “Levels” where each key exists in exactly one SSTable per level.
How does Cassandra handle “Hot Partitions”?
- Poorly. You must design your Partition Key to spread data evenly. If a partition gets too big (>100MB), performance degrades.
Why use a Bloom Filter?
- To avoid hitting disk for keys that don’t exist in an SSTable.
What is the “Coordinator Node”?
- The node that receives the client request. It acts as a proxy for that specific request.
Consistency Level: ONE vs QUORUM?
- ONE: Fast, High Availability, potential Stale Read. QUORUM: Strong Consistency, lower availability (needs majority).
What is “Hinted Handoff”?
- If a replica is down, the Coordinator stores the write locally (“hint”) and replays it when the replica returns.
Does Cassandra support ACID transactions?
- No (mostly). It supports “Lightweight Transactions” (LWT) using Paxos (CAS), but they are slow.

11. Summary: The Whiteboard Strategy

1. Core Concepts

Wide Column: 2D Map (Row -> Col -> Val).
LSM Tree: Log-Structured Merge Tree (Write Heavy).
Gossip: Peer-to-Peer state exchange.
Tunable: AP or CP per request.

2. Write Path (LSM)

Write -> CommitLog (Disk)
|
v
MemTable (RAM)
| (Flush)
v
SSTables (Disk, Immutable)

* Compaction: Merges SSTables to clean up.
* Bloom Filter: Skips irrelevant SSTables.

3. Primary Key

            PRIMARY KEY ((Partition), Cluster)

            Partition Key: Which Node? (Hashing)

            Clustering Key: Which Order? (Sorting)

4. Critical Gotchas

Tombstones: Deletes are expensive (markers).
Partition Sizing: Keep partitions < 100MB.
Query Model: Must query by Partition Key. No Joins.