Google File System (GFS)

[!TIP] The Grandfather of Cloud Storage: GFS (2003) proved that you could build a massive, reliable file system using thousands of unreliable, “garbage” commodity servers. It defined the Master-ChunkServer pattern used by HDFS, BigTable, and Cloud Storage.

1. The “Failure is Normal” Mindset

Before GFS, enterprise storage meant expensive, high-reliability hardware (RAID arrays, SANs). Google couldn’t afford that for the web. They bought cheap disks that failed constantly.

Core Assumptions

Failures are Common: At any moment, a disk is dying, a rack is overheating, or a switch is rebooting. The software must self-heal.
Files are Huge: Optimized for multi-GB files (web crawls, logs), not tiny text files.
Append-Heavy: We rarely overwrite data (like a database). We mostly append new data to the end of a file.
Throughput > Latency: We care about processing 1TB of data quickly (batch), not reading 1KB in 5ms (interactive).

2. Architecture: The Single Master

GFS uses a Single Master architecture. In System Design, “Single Master” usually screams “Bottleneck!” and “SPOF!”. Why did Google do it?

Simplicity: Managing metadata in one place is easy.
Global Knowledge: The Master knows exactly where every chunk is, making placement decisions (load balancing) optimal.

The Components

GFS separates the Control Plane (Metadata) from the Data Plane (File Chunks), as visualized in the blueprint below.

The Master (The Brain):
- In-Memory Metadata: Stores the namespace (file system tree) and chunk locations in high-speed RAM.
- Operation Log (Disk): A persistent record of all metadata changes. If the Master fails, it replays this log to recover the system state.
- Heartbeats: The Master doesn’t “know” where chunks are at startup; it polls Chunkservers via red heartbeats to build its mapping.
ChunkServers (The Muscle):
- Store data as raw Linux files on commodity disks.
- Each 64MB chunk is replicated (usually 3x) and verified using Checksums to detect bit-rot.
The Client:
- A library linked into the application (e.g., MapReduce).
- Metadata Cache: It first asks the Master for locations (Control Path), then caches that info and talks directly to Chunkservers (Data Path) to avoid bottlenecking the Master.

Architecture: The GFS Blueprint

Single Master (Brain) | ChunkServers (Muscle) | Decoupled Control & Data

Control Path (Metadata)

Data Path (Read/Write)

Heartbeats

GFS Client

Metadata Cache (RAM)

Single Master

In-Memory Metadata

- Namespace (Files/Dirs)
- Chunk Mappings

Operation Log (Disk)

- Persistent Audit Trail
- Crash Recovery

CHUNKSERVER CLUSTER

Chunkserver A

64MB Chunk 1

64MB Chunk 2

Disk Checksum ✓

Chunkserver B

Chunkserver C

The 64MB Chunk Decision

Standard file systems use small blocks. GFS uses 64MB chunks (the green blocks in our diagram).

Why?
1. Metadata Size: A 1PB file system with 4KB blocks would require 250 billion entries. The Master’s RAM would explode. With 64MB, it’s ~16 million entries. Manageable.
2. Reduce Chatter: Clients ask the Master location once, then stream 64MB without bothering the Master again.
The Downside:
- Internal Fragmentation: A 1KB file still takes up a slot (though GFS lazy-allocates storage).
- Hot Spots: If a popular 50MB file fits in one chunk, all clients hit the same ChunkServer.

3. The Mutation Flow (Leases & Pipelining)

How do we write data to 3 replicas while keeping them consistent? GFS decouples Control Flow (the yellow arrows in our diagram) from Data Flow (the blue arrows).

The Steps

Lease Grant (Control): The Master grants a Lease (60s) to one replica, making it the Primary.
Data Pipelining (Data): The Client pushes data to any replica. That replica forwards it to the next.
- Optimization: Data flows linearly (A -> B -> C) to utilize the full outbound bandwidth of each machine, rather than a “star” topology.
Commit (Control): Once all replicas have the data in their buffer, the Client tells the Primary: “Execute!”
Ordering: The Primary assigns a serial number to the mutation and orders all Secondaries to write.
Ack: If all succeed, Primary replies “Success”. If any fail, it replies “Failure” (and the client retries).

Interactive Demo: The Write Pipeline

Visualize the separation of Control (Yellow) and Data (Blue).

Get Lease: Master designates a Primary.
Pipeline Data: Data flows A -> B -> C.
Commit: Primary triggers the write.

MASTER

Idle

Replica A

Secondary

Buffer: Empty

Disk: ---

Replica B

Secondary

Buffer: Empty

Disk: ---

Replica C

Secondary

Buffer: Empty

Disk: ---

System Ready. Waiting for client...

4. Relaxed Consistency (The “Gotcha”)

Most developers assume “File System” means “Strict Consistency” (like your laptop’s EXT4 or NTFS). GFS breaks this assumption to achieve massive scale.

Why Relaxed?

In a distributed system with thousands of nodes, ensuring every replica is bit-for-bit identical at the exact same nanosecond (Strict Consistency) requires expensive locking (e.g., Paxos). This kills throughput. GFS chooses Availability and Partition Tolerance over strict Consistency (AP in CAP).

The Two Write Modes

Standard Write: “Write data X at Offset Y”.
- Risk: If multiple clients write to the same offset concurrently, the region becomes “Undefined” (mixed data fragments). Applications must avoid this.
Record Append (The Magic): “Append data X to the file, I don’t care where”.
- Mechanism: GFS chooses the offset atomically. This is the primary mode for Google’s workloads (crawlers, logs).

The Atomic Record Append Problem

Imagine 3 MapReduce tasks trying to append logs to the same file.

Challenge: If they all try to write to offset 100, they will overwrite each other. Locking would be too slow.
GFS Solution: The Primary picks the offset. If the append (say 10MB) fits in the current chunk (remaining space 5MB), it Pads the chunk with garbage, tells the client “Retry at next chunk”, and the client writes to the new chunk.
Result: The file may contain Padding (holes) and Duplicates (if a retry succeeds after a partial failure).

Interactive Demo: Atomic Record Append

Visualize how GFS handles appends.

Scenario: We have a 64MB Chunk.
Action: Append Data. If it fits, it writes. If not, it pads and moves to the next chunk.

0MB

0MB Capacity: 64MB

Waiting for Append...

[!WARNING] Interview Tip: Never say GFS is “Strongly Consistent”. It is Consistent (metadata) but Relaxed (file data). It sacrifices strictness for performance.

5. Shadow Masters (High Availability)

The Master is a SPOF. GFS uses Shadow Masters:

Read-only replicas of the Master.
Lag behind the real Master slightly (via Operation Log replication).
Use Case: If the Master fails, read-only clients can switch to a Shadow Master to keep reading (but cannot write).

6. Observability & Tracing

To keep a massive distributed system running, we need the RED Method (Rate, Errors, Duration).

Key Metrics

Master RAM Usage: Since metadata is in RAM, this is the hard limit of the cluster size.
- Alert: Warn at 80% usage.
RPC Latency: How long does GetChunkLocation take?
- Tracing: Use Distributed Tracing (Trace ID) to see if the Master is slow due to Lock Contention or CPU.
ChunkServer Health:
- Disk Failures: Rate of IO Error.
- Checksum Failures: Rate of data corruption.

Dashboarding

A typical Grafana dashboard for GFS would show:

Global Throughput: Total Read/Write MB/s.
Replication Lag: How far behind are the Shadow Masters?
Stragglers: List of ChunkServers with high latency (slow disks).

7. Deployment Strategy

Updating a system with 10,000 nodes requires care.

Rolling Updates

ChunkServers: Update 10% of the fleet at a time.
- Drain: Mark node as “Maintenance”. Master stops sending new writes.
- Update: Reboot with new binary.
- Verify: Check heartbeats.
Master:
- Failover: Promote a Shadow Master to Primary (if architecture allows) or perform a fast restart (Operation Log replay).
- Downtime: GFS Master restarts are usually fast (< 1 min), acceptable for batch jobs but not for interactive apps.

8. Requirements Traceability Matrix

Requirement	Architectural Solution
High Throughput	Decoupled Control/Data paths + Pipelined Writes.
Scalability	Single Master (Metadata) + Thousands of ChunkServers (Data).
Reliability	3x Replication + Checksums + Heartbeats.
Consistency	Relaxed Model (Atomic Append) for performance.
Cost Efficiency	Runs on commodity hardware (cheap disks).

9. Interview Gauntlet

I. Architecture

Why Single Master? Simplifies metadata management and placement decisions.
How to solve Master Bottleneck? Clients cache metadata. Data flows directly between ChunkServers.
Why 64MB Chunks? Reduces metadata size in Master RAM. Reduces client-master chatter.

II. Consistency

What is Record Append? Atomic append at an offset chosen by GFS.
Does GFS guarantee byte-for-byte identity? No. Replicas might have different padding/duplicates.
How to handle duplicates? Application-level checksums or unique IDs in the record.

III. Failure Modes

What if Master dies? System halts (reads might work via Shadow Master). Restarts via Operation Log.
What if a ChunkServer is slow? The client might timeout. Pipelining helps, but stragglers are a known issue.
Data Corruption? Checksums on every 64KB block. Verified on read and during idle scans.

10. Summary: The Whiteboard Strategy

If asked to design GFS, draw this 4-Quadrant Layout:

1. Requirements

Func: Create, Read, Append, Delete.
Non-Func: Huge Files, High Throughput, Fault Tolerance.
Scale: 10k Nodes, 100PB Data.

2. Architecture

[Client] --(Meta)--> [Master (RAM)]
|
(Data Flow)
|
[ChunkServer 1] -> [ChunkServer 2] -> [ChunkServer 3]

* Single Master: Metadata.
* ChunkServers: Data (Linux FS).

3. Metadata

            File: /usr/data/log.txt -> [ChunkID 1, ChunkID 2]

            Chunk: ID -> [Server A, Server B, Server C]

            Log: OpLog for recovery.

4. Trade-offs

Consistency: Relaxed (Append) for performance.
Chunk Size: 64MB to reduce Master load.
Replication: 3x for reliability on cheap hardware.

Next, let’s see how the open-source world adopted this in HDFS (Hadoop).