Design Amazon S3 (Object Storage)

[!IMPORTANT] In this lesson, you will master:

Object Storage Fundamentals: Understanding the difference between block, file, and object storage.

Exabyte-Scale Architecture: Designing the control and data planes to handle trillions of objects.

Erasure Coding & Durability: How to achieve 11 9s of durability without tripling your storage costs.

Strong Consistency: The evolution from eventual to strong consistency in distributed metadata systems.

1. What is Amazon S3?

Imagine you are building the next Netflix, Spotify, or Dropbox. You have millions of users uploading petabytes of profile pictures, videos, and documents every single day. A traditional hard drive or even a massive network-attached storage (NAS) array will quickly run out of space, become too expensive, or fail under the sheer throughput of requests. How do you store virtually infinite amounts of unstructured data without ever losing a single byte? Welcome to the world of Object Storage.

Amazon S3 (Simple Storage Service) is an Object Storage service that offers industry-leading scalability, data availability, security, and performance. Unlike a file system (hierarchical, POSIX), S3 is a flat namespace where you store “Objects” (files) inside “Buckets” (containers).

Key Characteristics

Scale: Exabytes of data. Trillions of objects.
Durability: 11 9s (99.999999999%). You essentially never lose data.
Availability: 99.99%.
Performance: High throughput for large blobs.

Try it yourself: Upload a 10GB file to S3 via CLI. Notice it finishes faster than your single-thread bandwidth? That’s Multipart Upload in action.

2. Process Requirements & Goals

Functional Requirements

Bucket Operations: Create/Delete Bucket.
Object Operations: Put, Get, Delete, List Objects.
Versioning: Support multiple versions of an object.
Large Files: Support files up to 5TB (via Multipart).

Non-Functional Requirements

Durability: 11 9s. We must tolerate simultaneous disk/rack/DC failures.
Availability: The system must always accept writes and reads.
Scalability: Horizontal scaling for both storage and metadata.
Consistency: Since 2020, S3 offers Strong Consistency. A successful PUT is immediately visible to a subsequent GET.

3. Estimate (Capacity)

Let’s design for a massive scale.

Storage

Total Objects: 100 Billion.
Avg Size: 1 MB.
Total Data: 100 Billion * 1 MB = 100 Petabytes (PB).
Growth: 10% month-over-month.

Throughput

Read QPS: 100,000 QPS.
Write QPS: 10,000 QPS.
Bandwidth: If avg request is 1MB, 100k QPS = 100 GB/s outbound.

4. System APIs

S3 uses a RESTful API.

4.1 Bucket Operations

POST /my-bucket
DELETE /my-bucket

4.2 Object Operations

PUT /my-bucket/photo.jpg
Body: <binary_data>

GET /my-bucket/photo.jpg
Response: 200 OK, Body: <binary_data>

4.3 Multipart Upload

For files > 100MB.

Initiate: POST /bucket/file?uploads → Returns UploadId.
Upload Part: PUT /bucket/file?partNumber=1&uploadId=xyz.
Complete: POST /bucket/file?uploadId=xyz (Merges parts).

5. Data Model (Metadata & Block Store)

We separate Metadata from Data.

Metadata Store (Key-Value)

Stores attributes: Name, Size, Owner, ACLs, Location (Pointer to Block Store). Think of this as the Librarian’s Index: it doesn’t hold the book itself, just the card telling you exactly which shelf and aisle to check.

Key: BucketName + ObjectName.
Value: JSON Metadata + List of Block IDs.
Tech Choice: NewSQL (CockroachDB/Spanner) or Sharded KV (Cassandra/DynamoDB) with Paxos for Strong Consistency.

Block Store (Blob)

Stores the immutable bits. Think of this as the Warehouse: massive, cavernous space optimized purely for storing the heavy lifting (the actual files), entirely unaware of what the files contain or who owns them.

Filesystem: Custom lightweight FS (like Facebook Haystack) optimized for large sequential writes.
Addressing: Addressed by BlockID (UUID).

6. Architecture (High-Level Design)

Architecture separating Metadata and Data planes.

System Architecture: Object Storage (S3)

Metadata Service | Block Allocation | Erasure Coding | Storage Nodes

Interface Plane

User → API Nodes

• Auth (IAM)

• Rate Limit

• Routing

Metadata Plane

Metadata Svc

• KV Store

• Strong Consistency

• Namespace

Placement Svc

• Allocates Block IDs

• Monitors Health

Storage Plane

Storage Cluster (Erasure Coding)

Rack 1
(Data 1, 2)

Rack 2
(Data 3, 4)

Rack 3
(Parity 1)

Rack 4
(Parity 2)

Operation Flow (PUT):

→ Client → API Node: PUT Object

→ API Node → Metadata: 1. Get Block ID

→ API Node → Storage: 2. Stream Data (Write)

→ Storage → Metadata: 3. Commit

Client sends PUT /bucket/file.jpg.
API Node authenticates request.
Metadata Service checks bucket exists and authorizes user.
Placement Service allocates a BlockID and determines which Storage Nodes to write to.
API Node streams data to Storage Nodes (using Erasure Coding).
Once data is durable (written to quorum), Metadata Service commits the object (Map file.jpg → BlockID).

7. Localized Details (Component Deep Dive)

11 9s Durability: Erasure Coding

Storing 3 copies of 100 PB means storing 300 PB. That is too expensive ($).

Replication: 200% overhead (3 copies). Safe but wasteful.
Erasure Coding (EC): Breaks data into N data chunks and K parity chunks.
Analogy: Imagine you have a top-secret document. Instead of printing 3 full copies (wasting paper), you shred the document into 10 pieces and create 4 mathematical “recovery” pieces. As long as you have any 10 pieces, you can recreate the document.
Reed-Solomon (10, 4): Split file into 10 parts. Calculate 4 parity parts.
Overhead: Only 40% (vs 200%).
Durability: Can lose ANY 4 drives and still recover.
Trade-off: High CPU usage for calculation, but storage savings are worth it.

Strong Consistency (The 2020 Shift)

For years, S3 was Eventually Consistent (overwrite a file, you might see the old one). In 2020, they switched to Strong Consistency.

Practical Example: Previously, if you uploaded a new profile picture and immediately hit refresh, you might temporarily see your old picture because the updated pointer hadn’t propagated globally. Now, a successful PUT guarantees the next GET sees the new image.
How?: The Metadata layer now uses a Distributed Consensus Algorithm (likely variants of Paxos or Raft) for every single write.
Why now?: Hardware got faster. Network latency dropped. CPU is cheaper. The overhead of consensus is now negligible compared to the network transfer time of the data blob.
Cache Coherency: They also implemented a system to actively invalidate caches across the fleet immediately upon commit.

Multipart Upload

Uploading a 5GB file in one stream is risky. If it fails at 99%, you retry from zero.

Parallelism: Break file into 50 chunks of 100MB. Upload them in parallel.
Resiliency: If chunk 45 fails, retry only chunk 45.
Throughput: Maximize bandwidth by saturating multiple TCP connections.

[!NOTE] War Story: The “Thundering Herd” of Cloud Storage A major media company once experienced a thundering herd problem when a highly anticipated 10GB video file was published. Thousands of clients tried to fetch it simultaneously, overwhelming the network bandwidth of the specific storage nodes hosting that object. They solved this by introducing an edge CDN cache to collapse the simultaneous requests, and utilizing S3’s Multipart Upload to pre-warm the distributed file chunks.

8. Scale & Requirements Traceability

Requirement	Design Decision	Justification
11 9s Durability	Erasure Coding (10+4)	Can tolerate loss of 4 simultaneous availability zones/disks with low storage overhead.
Scalability	Separated Control/Data Plane	Metadata scales independently of Storage. Data path bypasses metadata bottleneck.
Cost	Tiered Storage (Glacier)	Move cold objects to cheaper, slower media (Tape/HDD) automatically.
Performance	Multipart Upload	Parallelizes writes to maximize throughput and fault tolerance.
Consistency	Consensus (Paxos)	Ensures Metadata updates are atomic and strongly consistent.

9. Observability & Metrics

Key Metrics

Durability: Checksums. Background scrubbers constantly read data to verify integrity.
Availability: Error Rate (5xx).
Latency: Time to First Byte (TTFB).
Storage Efficiency: (Used Space / Raw Space). Monitor overhead of Erasure Coding.

10. Deployment Strategy

Immutable Infrastructure

We never patch storage nodes. We replace them.

Data Migration: When a disk is retiring, the system treats it as “failed” and reconstructs its data onto a new node using Erasure Coding.
Zone Deployment: Updates are rolled out one Availability Zone at a time.

11. Interview Gauntlet

Rapid Fire Questions

Why use Erasure Coding over Replication? Replication (3x) wastes 200% storage. EC (10+4) only wastes 40% for higher durability. At Exabyte scale, this saves billions of dollars.
How does S3 handle small files? Small files cause metadata bloat and disk fragmentation. S3 aggregates small objects into larger 100MB “containers” or “shards” before writing to disk.
What happens if two users write the same key at the same time? Last Write Wins. The Metadata service serializes the commit requests. The one processed last overwrites the pointer.
Is S3 a filesystem? No. It is a Key-Value store. It does not support rename (move) efficiently. Renaming a “folder” foo/ to bar/ requires rewriting every single object inside with the new key.

12. Interactive Decision Visualizer: Erasure Coding

See how Reed-Solomon encoding works. We split data and generate parity. You can “destroy” chunks and see if the data survives.

Disk 1

Placeholder

Disk 2

Placeholder

Disk 3

Placeholder

Disk 4

Placeholder

Disk 5

Placeholder

Disk 6

Placeholder

System Idle. Click 'Encode' to write data.

13. Summary

Erasure Coding: The key to 11 9s durability without 300% storage cost.
Strong Consistency: Achieved via Paxos on the Metadata layer.
Multipart Upload: Essential for performance and reliability on large files.
Separation: Metadata scaling (LSM/NewSQL) is handled separately from Blob storage.

Design Amazon S3 (Object Storage)

Design Amazon S3 (Object Storage)

1. What is Amazon S3?

Key Characteristics

2. Process Requirements & Goals

Functional Requirements

Non-Functional Requirements

3. Estimate (Capacity)

Storage

Throughput

4. System APIs

4.1 Bucket Operations

4.2 Object Operations

4.3 Multipart Upload

5. Data Model (Metadata & Block Store)

Metadata Store (Key-Value)

Block Store (Blob)

6. Architecture (High-Level Design)

7. Localized Details (Component Deep Dive)

11 9s Durability: Erasure Coding

Strong Consistency (The 2020 Shift)

Multipart Upload

8. Scale & Requirements Traceability

9. Observability & Metrics

Key Metrics

10. Deployment Strategy

Immutable Infrastructure

11. Interview Gauntlet

Rapid Fire Questions

12. Interactive Decision Visualizer: Erasure Coding

13. Summary

Found this lesson helpful?