Module Review: Data Modeling
This module review covers the essential principles of Cassandra data modeling, including query-driven design, keys, and denormalization.
Key Takeaways
- Query-First: Always start with your application queries. Map 1 Query → 1 Table.
- Partition Key: Determines which node stores the data. Must have high cardinality to avoid Hot Partitions.
- Clustering Key: Determines the sort order of data on disk. Enables efficient range queries.
- Denormalization: Duplicating data is necessary to achieve fast reads.
- Write Amplification: Writing to multiple tables is cheaper than doing distributed JOINs.
Flashcards
Cheat Sheet
Primary Key Syntax
| Syntax | Partition Key | Clustering Key |
|---|---|---|
PRIMARY KEY (a) |
a |
None |
PRIMARY KEY (a, b) |
a |
b |
PRIMARY KEY ((a, b), c) |
a, b |
c |
PRIMARY KEY ((a), b, c) |
a |
b, c |
Modeling Do’s and Don’ts
| Do | Don’t |
|---|---|
| ✅ Start with Queries | ❌ Start with Tables |
| ✅ Duplicate Data | ❌ Use client-side JOINs |
| ✅ High Cardinality PK | ❌ Low Cardinality PK (e.g., Boolean) |
| ✅ Use Batches for Sync | ❌ Use Batches for Bulk Load |
| ✅ Order by Clustering Key | ❌ Order by client-side sorting |
Practice Scenario
Task: Design a schema for a “IoT Sensor Network”.
- We have thousands of sensors.
- We need to see the latest temperature for a specific sensor.
- We need to see all temperature readings for a specific sensor for a specific day.
Solution:
CREATE TABLE sensor_readings_by_day (
sensor_id uuid,
date date,
recorded_at timestamp,
temperature decimal,
PRIMARY KEY ((sensor_id, date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
- Partition Key:
(sensor_id, date)- Ensures that a single partition doesn’t grow indefinitely. Each day is a new partition. - Clustering Key:
recorded_at- Sorts readings chronologically.