Module 12 Review & Cheat Sheet

1. Interactive Flashcards

Click on a card to reveal the answer.

Block-Level Deduplication

Tap to Flip

A technique where files are split into chunks (e.g., 4MB). Each chunk is hashed. If the hash exists in the DB, the chunk is not uploaded again. Saves massive bandwidth and storage.

Rolling Hash (Rabin-Karp)

Tap to Flip

A hash function used to find chunk boundaries based on content patterns rather than fixed offsets. Essential for Delta Sync to handle insertions/deletions efficiently.

Adaptive Bitrate Streaming (ABR)

Tap to Flip

The client player dynamically switches video quality (360p ↔ 1080p) based on network bandwidth and buffer health to prevent stalling.

Bloom Filter

Tap to Flip

A space-efficient probabilistic data structure used to test if an element is in a set. Zero false negatives, possible false positives. Used in Crawlers to track visited URLs.

Trie (Prefix Tree)

Tap to Flip

A tree data structure used for Typeahead. It allows O(L) prefix lookups (where L is key length) and stores "Top-K" popular searches at each node for O(1) retrieval.

Namespace Flattening

Tap to Flip

Storing file hierarchy as relational rows (`id`, `parent_id`) instead of nested paths. Allows renaming a folder in O(1) time by just updating one row.

SimHash

Tap to Flip

A locality-sensitive hashing algorithm where similar documents have similar hashes (small Hamming Distance). Used by Crawlers to detect near-duplicate content.

Transcoding DAG

Tap to Flip

A Directed Acyclic Graph pipeline that breaks video processing into parallel tasks: Validation → Metadata → Splitting → Transcoding → Merging.

Presigned URL

Tap to Flip

A secure URL generated by the server that grants temporary permission to upload/download a specific object directly to/from S3, bypassing the application server.

Consistent Hashing

Tap to Flip

A distribution scheme that maps keys and nodes to a ring. Minimizes data movement when nodes are added or removed (only K/N keys move).

Widevine / DRM

Tap to Flip

Digital Rights Management systems used to encrypt video content. The player must authenticate with a License Server to decrypt and play the media.

Head-of-Line Blocking

Tap to Flip

A performance issue in TCP where packet loss blocks all subsequent packets. QUIC (UDP) solves this by allowing independent streams.


2. System Comparison Table

Feature Dropbox YouTube Web Crawler Typeahead
Primary Metric Consistency (Sync) Bandwidth (Throughput) Throughput (Crawl Rate) Latency (< 100ms)
Read/Write Ratio Write-Heavy (Syncing) Extremely Read-Heavy (1:1M) Write-Heavy (Fetching) Read-Heavy (20:1)
Core Storage Block Storage (S3) Blob (S3) + CDN NoSQL (BigTable) In-Memory Trie
Key Algorithm Rolling Hash (Rabin-Karp) ABR / Transcoding Bloom Filter Top-K Ranking
Consistency Strong (Metadata) Eventual Eventual Eventual
Bottleneck Storage Costs Network Bandwidth Politeness / DNS CPU / RAM

3. Cheat Sheet Patterns

1. Storage Optimization

  • Deduplication: Never store the same data twice.
    • Dropbox: Hash blocks (SHA-256). If hash exists, just point to it.
    • Crawler: SimHash documents. If Hamming Distance is small, discard.
  • Cold Storage: Move data to cheaper storage (Glacier) based on access patterns.
    • YouTube: Old videos go to HDD/Tape.
    • Dropbox: Deleted files/Old versions go to Cold Storage.

2. Latency Optimization

  • CDN / Edge: Move data closer to the user.
    • YouTube: Edge servers cache popular videos.
    • Typeahead: Browser caches prefixes (mamax-age=3600).
  • Anycast DNS: Route users to the nearest data center network-wise.
  • Pre-Computation: Do the hard work offline.
    • Typeahead: Aggregate logs and build the Trie offline. The live service just does lookups.

3. Reliability Patterns

  • Thundering Herd: When a cache expires, thousands of requests hit the DB at once.
    • Solution: Jitter (Randomize expiration time) and Request Coalescing (Single-flight).
  • Backpressure: When the system is overloaded, tell clients to slow down.
    • Crawler: Politeness queues ensure we don’t DDoS the target.
    • YouTube: If bandwidth is low, downgrade quality (ABR) instead of pausing.

[!TIP] Interview Tip: Always calculate the Scale first.

  • If it’s PB of data, talk about Partitioning and Cold Storage.
  • If it’s Millions of QPS, talk about Caching and Load Balancing.
  • If it’s Text Search, talk about Inverted Indices or Tries.