Module 15: Review & Cheat Sheet

1. Quick Revision

  • Dynamo: The father of NoSQL. Prioritized Availability (AP). Introduced Consistent Hashing, Vector Clocks, and Gossip.
  • Cassandra: The hybrid. BigTable Data Model (Wide Column) + Dynamo Architecture (Ring). Optimized for Writes (LSM Trees).
  • BigTable: The structured map. Master-Slave architecture. Used for Google Search/Maps. Uses SSTables on GFS.
  • MapReduce: Distributed computing. Map (Filter/Transform) -> Shuffle (Group by Key) -> Reduce (Aggregate).
  • Bloom Filters: Probabilistic set. “Maybe in set” or “Definitely not”. Used to avoid expensive disk reads.

2. Cheat Sheet: Database Comparison

Feature Dynamo (Amazon) Cassandra (Facebook/Apache) BigTable (Google)
Data Model Key-Value (Blob) Wide Column (2D Map) Wide Column (Sparse Map)
Architecture P2P (Leaderless Ring) P2P (Leaderless Ring) Master-Slave
Consistency Eventual (AP) Tunable (AP or CP) Strong (CP)
Conflict Res. Vector Clocks (Client Side) LWW (Last Write Wins) Strong (Single Row Atomic)
Storage Engine Pluggable (BDB, etc.) LSM Tree SSTable (LSM-like)
Gossip? Yes Yes No (Uses Chubby/Master)
Primary Use Shopping Cart, Session Activity Feed, Metrics Analytics, Search Index

3. Interactive Flashcards

Test your knowledge. Click to flip.

What is a Tombstone?

(Click to reveal)

A Deletion Marker

In LSM Trees (Cassandra), you can't delete from immutable SSTables. You write a "Tombstone" to mark data as deleted. It is removed during Compaction.

Vector Clock

(Click to reveal)

Causality Tracker

A list of (Node, Counter) pairs used in Dynamo to detect conflicting updates in a distributed system. e.g., [A:1, B:2].

Bloom Filter Guarantee

(Click to reveal)

No False Negatives

If a Bloom Filter says "No", the item is DEFINITELY not in the set. If it says "Yes", it MIGHT be (False Positive).

Hinted Handoff

(Click to reveal)

Temporary Storage

If a node is down, a neighbor accepts the write with a "hint" to replay it when the target node comes back online. Ensures Availability.

MapReduce Combiner

(Click to reveal)

Local Reducer

Runs on the Mapper node to pre-aggregate data (e.g., sum counts) before sending over the network. Reduces bandwidth usage.

MemTable vs SSTable

(Click to reveal)

RAM vs Disk

MemTable is the In-Memory buffer (Mutable). SSTable is the On-Disk file (Immutable). Data moves MemTable -> SSTable.

Gossip Protocol

(Click to reveal)

Epidemic Failure Detection

Nodes randomly exchange state information to discover failures and membership changes without a central master.

What is YARN?

(Click to reveal)

Resource Negotiator

The OS of Hadoop. It allocates CPU/RAM to applications (MapReduce, Spark) and manages scheduling.

BigTable Tablet

(Click to reveal)

A Range of Rows

BigTable shards data into Tablets based on Row Key ranges. Tablets split when they get too big (~200MB).

Merkle Tree

(Click to reveal)

Efficient Sync

A hash tree used by Dynamo/Cassandra to find data differences between replicas quickly without transferring all data.

Tunable Consistency

(Click to reveal)

R + W > N

The formula to guarantee Strong Consistency in a quorum-based system. R=Read Quorum, W=Write Quorum, N=Replication Factor.