MapReduce: Moving Code to Data
[!TIP] The Paradigm Shift: Before MapReduce (2004), we moved Data to Code (download file, process it). With 10 Petabytes, this is impossible. MapReduce moves Code to Data. You send a small script (10KB) to the server holding the data (1TB).
1. What is MapReduce?
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
- Reduce: Takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
2. Distributed Architecture
The core of MapReduce is orchestrating parallel workers while managing data movement across the network.
3. The Phases
Distributed computing follows a strict sequence of stages, orchestrated by the Master Node as visualized in the architecture pipeline above.
Phase 1: Input Splits (Data Locality)
The 10PB file is stored in 64MB blocks on GFS. MapReduce starts a worker on every machine that holds a block. This is called Data Locality—moving the computation to the data rather than the other way around.
Phase 2: Map
The “Mapper” function reads the block and emits Key-Value pairs.
- Input:
Deer Bear River - Output:
(Deer, 1), (Bear, 1), (River, 1)
Phase 3: Shuffle & Sort (The Network Boundary)
The system groups all values by Key. As shown in the diagram, this is the Shuffle Boundary where data moves across the network from Mapper Local Disks to Reducer workers.
- All
Bearkeys go to Node A. AllDeerkeys go to Node B. - Result:
Bear: [1, 1, 1],Deer: [1, 1]
Phase 4: Reduce
The “Reducer” function sums the list.
- Input:
Bear: [1, 1, 1] - Output:
Bear: 3
4. Optimization: The Combiner
Sending (Bear, 1) a million times over the network is wasteful.
- Combiner: Runs locally on the Mapper node. It pre-aggregates data before sending.
- Before: Send
(Bear, 1), (Bear, 1), (Bear, 1)-> Network -> Reducer. - After: Combiner sums them locally -> Send
(Bear, 3)-> Network -> Reducer. - Benefit: drastically reduces network bandwidth.
Interactive Demo: Word Count Flow & Combiner
Visualize data transformation. Toggle “Use Combiner” to see how it reduces network traffic (packets).
File 2: "Banana Car Car"
5. Modern Evolution: Spark vs Flink
MapReduce (2004) is “Generation 1”. It writes to Local Disk after every step (Map -> Disk -> Reduce -> Disk), as shown in our primary architecture diagram. This persistent intermediate storage provides fault tolerance but is slow.
Apache Spark (Gen 2 - Batch)
- In-Memory: Keeps data in RAM (RDDs) between steps, avoiding the Disk I/O shown in the MapReduce pipeline.
- Speed: 100x faster than Hadoop MapReduce for iterative algorithms (ML, Graphs).
- Lazy Evaluation: Builds a DAG (Graph) of tasks and optimizes execution.
Apache Flink (Gen 3 - Streaming)
- True Streaming: Processes data row-by-row as it arrives (Low Latency).
- Spark Streaming: Actually “Micro-batching” (collects 1s of data, then runs MapReduce).
- Use Flink: For real-time fraud detection. Use Spark: For nightly reports/ETL.
6. System Walkthrough: WordCount Execution (Dry Run)
Let’s trace a job to count words in 100TB of text files.
Step 1: Input Split
- Master Node divides files into 64MB chunks.
- Master assigns Chunk A to Worker 1 (because Worker 1 has Chunk A on its disk).
Step 2: Map (Worker 1)
- Worker 1 reads Chunk A line by line.
- Line: “Hello World Hello”
- Function:
emit(word, 1) - Output (in Memory Buffer):
(Hello, 1), (World, 1), (Hello, 1)
Step 3: Spill & Sort (Worker 1 Disk)
- Buffer fills up. Worker 1 sorts keys.
- Writes to Local Disk:
(Hello, [1, 1]), (World, [1]). - Note: This is the “Map Output”.
Step 4: Shuffle (Network)
- Reducer A is responsible for words starting with A-M.
- Reducer B is responsible for N-Z.
- Reducer A pulls
(Hello, [1, 1])from Worker 1. - Reducer B pulls
(World, [1])from Worker 1.
Step 5: Sort/Merge (Reducer A)
- Reducer A receives data from all Mappers.
- It merges them into a single stream:
Hello: [1, 1, 1, 1...].
Step 6: Reduce (Reducer A)
- Function:
sum(values) - Output:
(Hello, 4500) - Write result to GFS (Output File).
7. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Process Petabytes | Distributed Workers + Data Locality (Move code to data). |
| Fault Tolerance | Intermediate data on disk. If Reducer dies, just restart it. If Mapper dies, re-run map on another node. |
| Scalability | Horizontal scaling. Add 1000 nodes = 1000x throughput. |
| Bandwidth Optimization | Combiner Function (Local aggregation). |
| Straggler Handling | Speculative Execution (Run slow task on backup node). |
8. Interview Gauntlet
- Why does MapReduce write to disk between Map and Reduce?
- Fault Tolerance. If the Reducer crashes, it can fetch the file again from the Mapper’s disk without re-running the entire Map phase.
- What is “Data Locality”?
- Scheduling the Map task on the same physical machine that holds the data block in GFS. Prevents network congestion.
- What is the “Shuffle” phase?
- The process of transferring data from Mappers to Reducers, grouping by Key. It is the most expensive part (Network I/O).
- What is “Speculative Execution”?
- If one node is slow (Straggler), the Master starts a duplicate task on another node. The first one to finish wins.
- Spark vs MapReduce?
- Spark keeps intermediate data in RAM. MapReduce writes to Disk. Spark is 100x faster for iterative jobs.
- What is a “Combiner”?
- A “Mini-Reducer” running on the Mapper to pre-aggregate data (e.g., sum counts) to save bandwidth.
- How does the Master know a worker is dead?
- Heartbeats. If no heartbeat for X seconds, mark dead and re-assign tasks.
- Can MapReduce handle real-time data?
- No. It is a Batch processing system with high latency. Use Flink/Kafka Streams for real-time.
- What is YARN?
- The resource negotiator (CPU/RAM) that decoupled scheduling from MapReduce logic in Hadoop 2.0.
- Partitioning Function?
Hash(Key) % NumReducers. Determines which Reducer gets which key.
9. Summary: The Whiteboard Strategy
1. Core Concepts
- Data Locality: Code moves to Data.
- Shared Nothing: Workers are independent.
- Fault Tolerance: Retry tasks, not jobs.
- Batch: High throughput, high latency.
2. The Pipeline
|
Map (RAM -> Disk)
| (Shuffle/Sort)
Reduce (Agg)
|
Output (GFS)
* Shuffle: The bottleneck.
* Sort: Guarantees sorted keys to Reducer.
3. Optimizations
Compression: Compress map output (Snappy).
Speculative Exec: Kill stragglers.
4. Evolution (Spark)
- MapReduce: Disk-based. Good for 1-pass ETL.
- Spark: RAM-based (RDD). Good for ML/Iterative.
- Flink: Streaming. Good for Real-time events.