Design Amazon S3 (Object Storage)
[!IMPORTANT] In this lesson, you will master:
- Object Storage Fundamentals: Understanding the difference between block, file, and object storage.
- Exabyte-Scale Architecture: Designing the control and data planes to handle trillions of objects.
- Erasure Coding & Durability: How to achieve 11 9s of durability without tripling your storage costs.
- Strong Consistency: The evolution from eventual to strong consistency in distributed metadata systems.
1. What is Amazon S3?
Imagine you are building the next Netflix, Spotify, or Dropbox. You have millions of users uploading petabytes of profile pictures, videos, and documents every single day. A traditional hard drive or even a massive network-attached storage (NAS) array will quickly run out of space, become too expensive, or fail under the sheer throughput of requests. How do you store virtually infinite amounts of unstructured data without ever losing a single byte? Welcome to the world of Object Storage.
Amazon S3 (Simple Storage Service) is an Object Storage service that offers industry-leading scalability, data availability, security, and performance. Unlike a file system (hierarchical, POSIX), S3 is a flat namespace where you store “Objects” (files) inside “Buckets” (containers).
Key Characteristics
- Scale: Exabytes of data. Trillions of objects.
- Durability: 11 9s (99.999999999%). You essentially never lose data.
- Availability: 99.99%.
- Performance: High throughput for large blobs.
Try it yourself: Upload a 10GB file to S3 via CLI. Notice it finishes faster than your single-thread bandwidth? That’s Multipart Upload in action.
2. Process Requirements & Goals
Functional Requirements
- Bucket Operations: Create/Delete Bucket.
- Object Operations: Put, Get, Delete, List Objects.
- Versioning: Support multiple versions of an object.
- Large Files: Support files up to 5TB (via Multipart).
Non-Functional Requirements
- Durability: 11 9s. We must tolerate simultaneous disk/rack/DC failures.
- Availability: The system must always accept writes and reads.
- Scalability: Horizontal scaling for both storage and metadata.
- Consistency: Since 2020, S3 offers Strong Consistency. A successful
PUTis immediately visible to a subsequentGET.
3. Estimate (Capacity)
Let’s design for a massive scale.
Storage
- Total Objects: 100 Billion.
- Avg Size: 1 MB.
- Total Data: 100 Billion * 1 MB = 100 Petabytes (PB).
- Growth: 10% month-over-month.
Throughput
- Read QPS: 100,000 QPS.
- Write QPS: 10,000 QPS.
- Bandwidth: If avg request is 1MB, 100k QPS = 100 GB/s outbound.
4. System APIs
S3 uses a RESTful API.
4.1 Bucket Operations
POST /my-bucket
DELETE /my-bucket
4.2 Object Operations
PUT /my-bucket/photo.jpg
Body: <binary_data>
GET /my-bucket/photo.jpg
Response: 200 OK, Body: <binary_data>
4.3 Multipart Upload
For files > 100MB.
- Initiate:
POST /bucket/file?uploads→ ReturnsUploadId. - Upload Part:
PUT /bucket/file?partNumber=1&uploadId=xyz. - Complete:
POST /bucket/file?uploadId=xyz(Merges parts).
5. Data Model (Metadata & Block Store)
We separate Metadata from Data.
Metadata Store (Key-Value)
Stores attributes: Name, Size, Owner, ACLs, Location (Pointer to Block Store). Think of this as the Librarian’s Index: it doesn’t hold the book itself, just the card telling you exactly which shelf and aisle to check.
- Key:
BucketName + ObjectName. - Value: JSON Metadata + List of Block IDs.
- Tech Choice: NewSQL (CockroachDB/Spanner) or Sharded KV (Cassandra/DynamoDB) with Paxos for Strong Consistency.
Block Store (Blob)
Stores the immutable bits. Think of this as the Warehouse: massive, cavernous space optimized purely for storing the heavy lifting (the actual files), entirely unaware of what the files contain or who owns them.
- Filesystem: Custom lightweight FS (like Facebook Haystack) optimized for large sequential writes.
- Addressing: Addressed by
BlockID(UUID).
6. Architecture (High-Level Design)
Architecture separating Metadata and Data planes.
(Data 1, 2)
(Data 3, 4)
(Parity 1)
(Parity 2)
- Client sends
PUT /bucket/file.jpg. - API Node authenticates request.
- Metadata Service checks bucket exists and authorizes user.
- Placement Service allocates a
BlockIDand determines which Storage Nodes to write to. - API Node streams data to Storage Nodes (using Erasure Coding).
- Once data is durable (written to quorum), Metadata Service commits the object (Map
file.jpg→BlockID).
7. Localized Details (Component Deep Dive)
11 9s Durability: Erasure Coding
Storing 3 copies of 100 PB means storing 300 PB. That is too expensive ($).
- Replication: 200% overhead (3 copies). Safe but wasteful.
- Erasure Coding (EC): Breaks data into
Ndata chunks andKparity chunks. - Analogy: Imagine you have a top-secret document. Instead of printing 3 full copies (wasting paper), you shred the document into 10 pieces and create 4 mathematical “recovery” pieces. As long as you have any 10 pieces, you can recreate the document.
- Reed-Solomon (10, 4): Split file into 10 parts. Calculate 4 parity parts.
- Overhead: Only 40% (vs 200%).
- Durability: Can lose ANY 4 drives and still recover.
- Trade-off: High CPU usage for calculation, but storage savings are worth it.
Strong Consistency (The 2020 Shift)
For years, S3 was Eventually Consistent (overwrite a file, you might see the old one). In 2020, they switched to Strong Consistency.
- Practical Example: Previously, if you uploaded a new profile picture and immediately hit refresh, you might temporarily see your old picture because the updated pointer hadn’t propagated globally. Now, a successful
PUTguarantees the nextGETsees the new image. - How?: The Metadata layer now uses a Distributed Consensus Algorithm (likely variants of Paxos or Raft) for every single write.
- Why now?: Hardware got faster. Network latency dropped. CPU is cheaper. The overhead of consensus is now negligible compared to the network transfer time of the data blob.
- Cache Coherency: They also implemented a system to actively invalidate caches across the fleet immediately upon commit.
Multipart Upload
Uploading a 5GB file in one stream is risky. If it fails at 99%, you retry from zero.
- Parallelism: Break file into 50 chunks of 100MB. Upload them in parallel.
- Resiliency: If chunk 45 fails, retry only chunk 45.
- Throughput: Maximize bandwidth by saturating multiple TCP connections.
[!NOTE] War Story: The “Thundering Herd” of Cloud Storage A major media company once experienced a thundering herd problem when a highly anticipated 10GB video file was published. Thousands of clients tried to fetch it simultaneously, overwhelming the network bandwidth of the specific storage nodes hosting that object. They solved this by introducing an edge CDN cache to collapse the simultaneous requests, and utilizing S3’s Multipart Upload to pre-warm the distributed file chunks.
8. Scale & Requirements Traceability
| Requirement | Design Decision | Justification |
|---|---|---|
| 11 9s Durability | Erasure Coding (10+4) | Can tolerate loss of 4 simultaneous availability zones/disks with low storage overhead. |
| Scalability | Separated Control/Data Plane | Metadata scales independently of Storage. Data path bypasses metadata bottleneck. |
| Cost | Tiered Storage (Glacier) | Move cold objects to cheaper, slower media (Tape/HDD) automatically. |
| Performance | Multipart Upload | Parallelizes writes to maximize throughput and fault tolerance. |
| Consistency | Consensus (Paxos) | Ensures Metadata updates are atomic and strongly consistent. |
9. Observability & Metrics
Key Metrics
- Durability: Checksums. Background scrubbers constantly read data to verify integrity.
- Availability: Error Rate (5xx).
- Latency: Time to First Byte (TTFB).
- Storage Efficiency: (Used Space / Raw Space). Monitor overhead of Erasure Coding.
10. Deployment Strategy
Immutable Infrastructure
We never patch storage nodes. We replace them.
- Data Migration: When a disk is retiring, the system treats it as “failed” and reconstructs its data onto a new node using Erasure Coding.
- Zone Deployment: Updates are rolled out one Availability Zone at a time.
11. Interview Gauntlet
Rapid Fire Questions
- Why use Erasure Coding over Replication? Replication (3x) wastes 200% storage. EC (10+4) only wastes 40% for higher durability. At Exabyte scale, this saves billions of dollars.
- How does S3 handle small files? Small files cause metadata bloat and disk fragmentation. S3 aggregates small objects into larger 100MB “containers” or “shards” before writing to disk.
- What happens if two users write the same key at the same time? Last Write Wins. The Metadata service serializes the commit requests. The one processed last overwrites the pointer.
- Is S3 a filesystem? No. It is a Key-Value store. It does not support
rename(move) efficiently. Renaming a “folder”foo/tobar/requires rewriting every single object inside with the new key.
12. Interactive Decision Visualizer: Erasure Coding
See how Reed-Solomon encoding works. We split data and generate parity. You can “destroy” chunks and see if the data survives.
13. Summary
- Erasure Coding: The key to 11 9s durability without 300% storage cost.
- Strong Consistency: Achieved via Paxos on the Metadata layer.
- Multipart Upload: Essential for performance and reliability on large files.
- Separation: Metadata scaling (LSM/NewSQL) is handled separately from Blob storage.