MapReduce: Moving Code to Data

[!TIP] The Paradigm Shift: Before MapReduce (2004), we moved Data to Code (download file, process it). With 10 Petabytes, this is impossible. MapReduce moves Code to Data. You send a small script (10KB) to the server holding the data (1TB).

1. What is MapReduce?

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Reduce: Takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

2. Distributed Architecture

The core of MapReduce is orchestrating parallel workers while managing data movement across the network.

Architecture: MapReduce Execution Pipeline

Distributed Scheduling | Data Locality | Intermediate Disk Storage

Input/Output (GFS)

Intermediate (Local Disk)

Network Traffic (Shuffle)

Master Node (Job Tracker)

Assign Tasks | Monitor Health

GFS INPUT

Data Block 1

Data Block 2

Data Block 3

Mapper Worker

1. Map function runs

2. Group by partitioning key

Intermediate Files (Local Disk)

Mapper Worker

(Parallel Execution)

Intermediate Files (Local Disk)

Mapper Worker

Intermediate Files (Local Disk)

SHUFFLE & SORT (Network)

Reducer Worker

3. Fetch from Mappers

4. Aggregate values

Reducer Worker

(Parallel Merge)

OUTPUT ()

Part-0

Part-1

3. The Phases

Distributed computing follows a strict sequence of stages, orchestrated by the Master Node as visualized in the architecture pipeline above.

Phase 1: Input Splits (Data Locality)

The 10PB file is stored in 64MB blocks on GFS. MapReduce starts a worker on every machine that holds a block. This is called Data Locality—moving the computation to the data rather than the other way around.

Phase 2: Map

The “Mapper” function reads the block and emits Key-Value pairs.

Input: Deer Bear River
Output: (Deer, 1), (Bear, 1), (River, 1)

Phase 3: Shuffle & Sort (The Network Boundary)

The system groups all values by Key. As shown in the diagram, this is the Shuffle Boundary where data moves across the network from Mapper Local Disks to Reducer workers.

All Bear keys go to Node A. All Deer keys go to Node B.
Result: Bear: [1, 1, 1], Deer: [1, 1]

Phase 4: Reduce

The “Reducer” function sums the list.

Input: Bear: [1, 1, 1]
Output: Bear: 3

4. Optimization: The Combiner

Sending (Bear, 1) a million times over the network is wasteful.

Combiner: Runs locally on the Mapper node. It pre-aggregates data before sending.
Before: Send (Bear, 1), (Bear, 1), (Bear, 1) -> Network -> Reducer.
After: Combiner sums them locally -> Send (Bear, 3) -> Network -> Reducer.
Benefit: drastically reduces network bandwidth.

Interactive Demo: Word Count Flow & Combiner

Visualize data transformation. Toggle “Use Combiner” to see how it reduces network traffic (packets).

Input Log Files

            File 1: "Apple Banana Apple"
File 2: "Banana Car Car"
        

MAP

Node 1

Waiting...

Node 2

Waiting...

SHUFFLE (Network)

REDUCE

Reducer A (A-B)

Waiting...

Reducer B (C-Z)

Waiting...

Enable Combiner (Optimize)

5. Modern Evolution: Spark vs Flink

MapReduce (2004) is “Generation 1”. It writes to Local Disk after every step (Map -> Disk -> Reduce -> Disk), as shown in our primary architecture diagram. This persistent intermediate storage provides fault tolerance but is slow.

Apache Spark (Gen 2 - Batch)

In-Memory: Keeps data in RAM (RDDs) between steps, avoiding the Disk I/O shown in the MapReduce pipeline.
Speed: 100x faster than Hadoop MapReduce for iterative algorithms (ML, Graphs).
Lazy Evaluation: Builds a DAG (Graph) of tasks and optimizes execution.

Apache Flink (Gen 3 - Streaming)

True Streaming: Processes data row-by-row as it arrives (Low Latency).
Spark Streaming: Actually “Micro-batching” (collects 1s of data, then runs MapReduce).
Use Flink: For real-time fraud detection. Use Spark: For nightly reports/ETL.

6. System Walkthrough: WordCount Execution (Dry Run)

Let’s trace a job to count words in 100TB of text files.

Step 1: Input Split

Master Node divides files into 64MB chunks.
Master assigns Chunk A to Worker 1 (because Worker 1 has Chunk A on its disk).

Step 2: Map (Worker 1)

Worker 1 reads Chunk A line by line.
Line: “Hello World Hello”
Function: emit(word, 1)
Output (in Memory Buffer): (Hello, 1), (World, 1), (Hello, 1)

Step 3: Spill & Sort (Worker 1 Disk)

Buffer fills up. Worker 1 sorts keys.
Writes to Local Disk: (Hello, [1, 1]), (World, [1]).
Note: This is the “Map Output”.

Step 4: Shuffle (Network)

Reducer A is responsible for words starting with A-M.
Reducer B is responsible for N-Z.
Reducer A pulls (Hello, [1, 1]) from Worker 1.
Reducer B pulls (World, [1]) from Worker 1.

Step 5: Sort/Merge (Reducer A)

Reducer A receives data from all Mappers.
It merges them into a single stream: Hello: [1, 1, 1, 1...].

Step 6: Reduce (Reducer A)

Function: sum(values)
Output: (Hello, 4500)
Write result to GFS (Output File).

7. Requirements Traceability Matrix

Requirement	Architectural Solution
Process Petabytes	Distributed Workers + Data Locality (Move code to data).
Fault Tolerance	Intermediate data on disk. If Reducer dies, just restart it. If Mapper dies, re-run map on another node.
Scalability	Horizontal scaling. Add 1000 nodes = 1000x throughput.
Bandwidth Optimization	Combiner Function (Local aggregation).
Straggler Handling	Speculative Execution (Run slow task on backup node).

8. Interview Gauntlet

Why does MapReduce write to disk between Map and Reduce?
- Fault Tolerance. If the Reducer crashes, it can fetch the file again from the Mapper’s disk without re-running the entire Map phase.
What is “Data Locality”?
- Scheduling the Map task on the same physical machine that holds the data block in GFS. Prevents network congestion.
What is the “Shuffle” phase?
- The process of transferring data from Mappers to Reducers, grouping by Key. It is the most expensive part (Network I/O).
What is “Speculative Execution”?
- If one node is slow (Straggler), the Master starts a duplicate task on another node. The first one to finish wins.
Spark vs MapReduce?
- Spark keeps intermediate data in RAM. MapReduce writes to Disk. Spark is 100x faster for iterative jobs.
What is a “Combiner”?
- A “Mini-Reducer” running on the Mapper to pre-aggregate data (e.g., sum counts) to save bandwidth.
How does the Master know a worker is dead?
- Heartbeats. If no heartbeat for X seconds, mark dead and re-assign tasks.
Can MapReduce handle real-time data?
- No. It is a Batch processing system with high latency. Use Flink/Kafka Streams for real-time.
What is YARN?
- The resource negotiator (CPU/RAM) that decoupled scheduling from MapReduce logic in Hadoop 2.0.
Partitioning Function?
- Hash(Key) % NumReducers. Determines which Reducer gets which key.

9. Summary: The Whiteboard Strategy

1. Core Concepts

Data Locality: Code moves to Data.
Shared Nothing: Workers are independent.
Fault Tolerance: Retry tasks, not jobs.
Batch: High throughput, high latency.

2. The Pipeline

Input (GFS)
|
Map (RAM -> Disk)
| (Shuffle/Sort)
Reduce (Agg)
|
Output (GFS)

* Shuffle: The bottleneck.
* Sort: Guarantees sorted keys to Reducer.

3. Optimizations

            Combiner: Local Reduce to save Net IO.

            Compression: Compress map output (Snappy).

            Speculative Exec: Kill stragglers.

4. Evolution (Spark)

MapReduce: Disk-based. Good for 1-pass ETL.
Spark: RAM-based (RDD). Good for ML/Iterative.
Flink: Streaming. Good for Real-time events.