Module Review: Database Basics
[!IMPORTANT] Staff Engineer Mentorship: Congratulations on completing Module 4! This is one of the most technical and critical modules in System Design. Database fundamentals aren’t just interview trivia; they are the exact mechanisms we use to debug critical production incidents at 3 AM. Let’s solidify your knowledge.
War Story: The 3 AM Data Loss
At a previous company, we assumed our payment service was bulletproof with asynchronous replication. During a massive traffic spike, a hard crash hit our primary database. Because fsync() was delayed to improve throughput (Group Commit), we lost 3 seconds of acknowledged payments. Always verify your durability guarantees (WAL) before going to production.
1. Key Takeaways
- SQL vs NoSQL: SQL (B-Trees) excels at complex reads and ACID, while NoSQL (LSM Trees) dominates high-throughput writes and horizontal scaling.
- ACID Guarantees: Relational databases ensure Atomicity, Consistency, Isolation, and Durability, prioritizing safety over availability.
- BASE Model: Distributed NoSQL systems embrace Basically Available, Soft state, and Eventual consistency for massive scale.
- Isolation Levels: Trade-offs exist between preventing anomalies (Dirty/Phantom Reads) and database performance using MVCC.
- WAL (Write-Ahead Logging): Databases survive crashes by appending changes to a sequential log on disk before updating main memory.
2. Cheat Sheet
| Concept | Definition | Key Trade-off |
|---|---|---|
| SQL (B‑Tree) | Relational, Normalized, ACID. | Great for Reads. Slow Writes (Random I/O). Hard to Shard. |
| NoSQL (LSM Tree) | Flexible, Denormalized, BASE. | Great for Writes (Sequential I/O). Weak for complex Reads/Joins. |
| ACID | Atomicity, Consistency, Isolation, Durability. | Safety vs Performance. |
| BASE | Basically Available, Soft state, Eventual consistency. | Availability vs Consistency. |
| Isolation Levels | Read Uncommitted → Serializable. | Preventing anomalies (Dirty, Phantom, Skew) costs speed. |
| WAL | Log changes to disk before RAM. | Durability. Converts Random I/O to Sequential I/O. |
| Vector Clocks | Tracks causality [A:1, B:2]. |
Detects concurrent updates (conflicts) without data loss. |
| MVCC | Multi-Version Concurrency Control. | Readers don’t block Writers. Prevents Dirty Reads implicitly. |
3. Interactive Flashcards
Test yourself. Click a card to flip it.
What is the "Golden Rule" of Distributed Consistency?
(Hint: R, W, N)
R + W > N
If Read Quorum + Write Quorum > Total Nodes, you are guaranteed Strong Consistency.
Why are LSM Trees faster for writes than B-Trees?
Sequential I/O
LSM Trees append to the end of a file (Sequential), avoiding the slow "Seek Time" (Random I/O) of B-Tree rebalancing.
What is a "Phantom Read"?
A Ghost Row
A transaction reads a set of rows matching a criteria (Age > 10). Another transaction inserts a new row that matches. Re-reading yields a different count.
What system call guarantees data is physically on the disk?
fsync()
It flushes the OS file system buffer to the physical disk platter, ensuring durability against power loss.
What is "Write Skew"?
Logic Bug in Snapshot Isolation
Two transactions read the same state (e.g., "2 Doctors on call"), make a decision, and update different rows, breaking a global invariant (0 Doctors on call).
How do Vector Clocks solve conflicts?
Causality Tracking
By tracking version counters per node [A:1, B:1], we can detect if updates happened concurrently and ask the app to merge them (instead of blindly overwriting).
When should you use NewSQL?
Scale + ACID
When you need the horizontal scaling of NoSQL but the strict financial correctness (ACID Transactions) of SQL. E.g., Global Payments.
Why is 2PC called the "Availability Killer"?
Blocking Protocol
If the Coordinator crashes, all participants are stuck holding locks indefinitely, freezing the system. Use Sagas instead.
Why are LSM Trees slower for reads?
Read Penalty
The DB must check the MemTable (RAM) and potentially multiple SSTables (Disk) to find the key. Bloom Filters help speed this up.
What is the risk of "Group Commit"?
Data Loss
Transactions are buffered in RAM to increase throughput. If power fails before the buffer flushes to disk, those transactions are lost forever.
- Anomalies: Dirty Read, Phantom Read, Write Skew (Snapshot Isolation).
4. System Design Checklist
When choosing a database, ask:
- Read vs Write Core? (B-Tree vs LSM Tree).
- Is
fsyncthe bottleneck? (Durability level). - Conflict Strategy? (Quorums, PACELC, Vector Clocks).
- Anomaly Tolerance? (Snapshot vs Serializable).
5. Staff Engineer Challenge
[!IMPORTANT] Can you calculate Write Amplification? A user updates a single integer in a row (4 bytes). The database uses a 16KB Page Size and writes a 128-byte WAL entry + updates the 16KB page on checkpoint.
- Calculate the initial Write Amplification (WAL).
- Calculate the total amplification after checkpointing. Hint: Physical writes / Logical change.
6. Next Steps
Move on to the next module to learn about traffic distribution: Load Balancing. Review terms in the System Design Glossary.