Design a Pastebin
[!IMPORTANT] In this lesson, you will master:
- 1. What is Pastebin?: Building intuition behind 1. what is pastebin?.
- 2. Requirements & Goals: Building intuition behind 2. requirements & goals.
- 3. Capacity Estimation: Building intuition behind 3. capacity estimation.
1. What is Pastebin?
Pastebin is a web service where users can upload snippets of text (usually code) and share them via a link. It is widely used by developers to share logs, config files, or code fragments.
Real-World Examples:
- Pastebin.com: The original.
- GitHub Gist: Version-controlled snippets.
- Hastebin: Minimalist version.
[!TIP] Interview Insight: This problem is very similar to “Design a URL Shortener” but with one major difference: Data Size. URL Shortener stores tiny strings (< 1KB). Pastebin stores potentially large text blobs (up to 10MB). This changes the storage layer completely.
2. Requirements & Goals (P - Problem & Requirements)
Functional Requirements
- Paste: Users can upload a block of text and get a unique URL.
- View: Users can view the paste by accessing the URL.
- Expiration: Pastes should expire after a set time (or never).
- Custom URL: Optional custom alias.
Non-Functional Requirements
- Reliability: Pastes must not be lost (Durability).
- Latency: Fast upload and download.
- Traffic: Read-heavy, but writes can be significant if used for logging pipelines.
Extended Requirements
- Analytics: Track how many times a paste is viewed.
- Edit: Allow users to edit pastes (requires account system).
- Privacy: Password-protected pastes or “Burn on Read”.
3. Capacity Estimation (E - Estimation)
Traffic Estimates
- 1 Million new pastes per day.
- Write QPS: 1,000,000 / 86400 \approx 12 writes/sec.
- Read QPS (5:1 ratio): \approx 60 reads/sec.
- Peak: Assume 5x traffic spikes (e.g., during a major outage where everyone shares logs).
Storage Estimates
- Average paste size: 10 KB. Max size: 10 MB.
- Daily Storage: 1M \times 10 \text{ KB} = 10 \text{ GB} / \text{day}.
- 10-Year Storage: 10 \text{ GB} \times 365 \times 10 \approx 36 \text{ TB}.
[!WARNING] 36 TB is significant. We cannot store this all in a single MySQL database instance efficiently. We need a Blob Store.
4. System APIs (D - Data Model)
Create Paste
POST /api/v1/paste
Request Body:
{
"content": "print('hello world')...", // Only for small pastes
"expire_minutes": 60,
"is_private": false
}
Response:
{
"url": "https://paste.bin/x9K2",
"key": "x9K2",
"uploadUrl": "https://s3.aws.com/..." // For large pastes
}
Get Paste
GET /api/v1/paste/{key}
Response: Returns the raw text or a HTML view.
5. Database Design: The Split Strategy (D - Data Model)
This is the most critical part of the design. We must Separate Metadata from Data.
The Wrong Way: Everything in SQL
| ID | Content (TEXT/BLOB) | Created_At |
|---|---|---|
| 1 | [10MB Data…] | 2023-01-01 |
Why this fails: Buffer Pool Pollution
Databases like MySQL/PostgreSQL use a memory cache called the Buffer Pool to keep frequently accessed data (hot pages) in RAM.
- The Mechanic: The DB reads data in “Pages” (usually 16KB).
- The Problem: If you store large BLOBs (10MB) in the same table, reading a single row forces the DB to load hundreds of pages into RAM.
- The Consequence: These huge blobs push out hundreds of other small, useful pages (like indexes or user session data).
- Result: Your Cache Hit Ratio drops near zero, Disk I/O spikes to 100%, and the DB slows to a crawl.
[!NOTE] War Story: The E-Commerce Outage A major e-commerce company once stored user profile images directly in their primary user database table as BLOBs. During a marketing push, a sudden spike in profile views forced the database to load massive image files into RAM. This immediately evicted the critical session-state indexes from the Buffer Pool. The result? The entire login system ground to a halt because simple username lookups suddenly required slow disk reads. They fixed it by migrating the images to an S3-compatible Object Store, restoring database performance.
The Right Way: Metadata + Object Store
We split the data into two stores:
- Metadata Database (SQL - labeled “Metadata DB”): Stores lightweight info (User, Expiry, Path).
- Size: ~100 bytes per row.
- Speed: Extremely fast lookups.
- Object Store (S3 - labeled “Object Store”): Stores the actual content as a file.
- Size: 1KB to 10MB.
- Speed: High throughput, cheaper cost.
Metadata SQL Schema:
CREATE TABLE pastes (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
short_key VARCHAR(7) NOT NULL UNIQUE, -- The Base62 key
user_id BIGINT, -- Foreign Key to Users
s3_path VARCHAR(255) NOT NULL, -- Pointer to S3 bucket (e.g., /2023/10/abc.txt)
expiration TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_user_id ON pastes(user_id); -- Fast lookups for "My Pastes"
6. High-Level Design (A - Architecture)
Interview-Friendly High-Level Diagram
This is the simplified version of the architecture you should draw on the whiteboard.
graph TD
User([User]) --> LB[Load Balancer]
LB --> API[API Gateway]
subgraph Service Layer
API --> WriteSvc[Write/Upload Service]
API --> ReadSvc[Read/Download Service]
WriteSvc --> KGS[Key Gen Service]
end
subgraph Storage Layer
WriteSvc --> DB[(Metadata DB)]
ReadSvc --> DB
WriteSvc --> S3[(S3 Object Store)]
ReadSvc --> Cache[(Redis Cache)]
Cache --> S3
end
The core architecture relies on splitting metadata (SQL) from payload (Object Store).
• Splits Data
• Handles Reads
• ShortKey
• S3 Path (URL)
• Content Text
• Scale: PB+
• Hot Pastes
7. Component Design: Presigned URLs (L - Localized Details)
In the design above, the App Server is a bottleneck because it must process the 10MB payload (ingress) and upload it to S3 (egress). This burns CPU and bandwidth.
Optimization: Presigned URLs (Direct Upload) We can let the Client talk directly to S3.
- Client requests a “Upload Token” from App Server.
- App Server generates a Presigned URL (a temporary, secure link to S3) and gives it to Client.
PUT https://my-bucket.s3.amazonaws.com/abc.txt?signature=...
- Client uploads the 10MB file directly to S3 using that URL.
- App Server is bypassed for the heavy lifting.
Trade-off: Slightly more complex client logic, but massive savings on App Server bandwidth.
8. Data Partitioning & Sharding (S - Scale)
As we scale to billions of pastes, the Metadata DB (SQL) needs to be sharded.
Sharding Key
We use the short_key as the sharding key.
- Strategy: Hash Sharding (
hash(short_key) % N). - Why?: Ensures even distribution of metadata across database nodes.
Object Store Sharding
S3 handles sharding automatically. However, to optimize S3 performance (which uses prefixes for partitioning), we should random-prefix our keys.
- Bad:
pastes/2023/10/01/abc.txt(Sequential dates create hotspots). - Good:
pastes/a1b2/abc.txt(Hash prefix spreads load).
9. Reliability, Caching, & Load Balancing
Content Delivery Network (CDN)
For popular pastes (e.g., a viral config file), S3 latency might be too high.
- Solution: Put a CDN (Cloudfront/Cloudflare) in front of S3.
- Flow: Client → CDN → S3.
- Cache Headers: Set
Cache-Control: public, max-age=86400for public pastes.
S3 Durability
S3 provides “11 9s” of durability (99.999999999\%). It achieves this by replicating data across at least 3 Availability Zones (AZs) automatically. We don’t need to manually replicate blobs.
10. Interactive Decision Visualizer: The Split Write Path
Visualize how the system handles small metadata vs large content. The Glow Effects indicate active data movement.
[!TIP] Try it yourself: Click Simulate Upload to see how data splits between SQL and S3. Note the different speeds.
11. System Walkthrough (Dry Run)
Scenario: Creating a Paste (Write)
- Client Request: Client sends a 5MB log file to
POST /paste. - Auth & Quota: App Server checks if user is logged in and hasn’t exceeded daily limits.
- KGS: App Server fetches a unique key
abcfrom Redis KGS. - Split Write:
- Async: App Server uploads the 5MB file to S3 bucket
pastes-bucketat keyabc. - Sync: App Server inserts metadata (User ID, Expiration, S3 URL) into the SQL DB.
- Async: App Server uploads the 5MB file to S3 bucket
- Response: App Server returns
https://pastebin.com/abc.
Scenario: Viewing a Paste (Read)
- Client Request: User visits
https://pastebin.com/abc. - CDN Check: Request hits Cloudfront. If cached, return immediately (10ms).
- App Server: If CDN miss, request hits App Server.
- Metadata Lookup: App Server queries SQL:
SELECT s3_url FROM pastes WHERE short_key = 'abc'. - Content Fetch: App Server streams the file from S3 (or generates a signed URL for the client to fetch).
- Render: App Server wraps the content in HTML (syntax highlighting) and returns it.
12. Interview Gauntlet
- Why not store the text in the Database?
- Ans: Buffer Pool Pollution. Large blobs evict hot pages (indexes) from RAM, destroying performance.
- How do you handle “Burn on Read” (Delete after view)?
- Ans: Add a
burn_countflag. On read, check the flag. If set, return data and immediately trigger an async job to delete from DB and S3.
- Ans: Add a
- How do you encrypt the data?
- Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment
#keywhich is not sent to server).
- Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment
- What if the S3 upload fails but DB insert succeeds?
- Ans: We end up with a “Ghost Paste” (Metadata exists, but content is missing).
- Fix: Upload to S3 first. Only if successful, insert into DB. If S3 succeeds but DB fails, we have an orphaned file (handled by S3 Lifecycle rules or a cleanup job).
- How do we handle popular pastes?
- Ans: Use a CDN. Cache the HTML response.
- Can we use a NoSQL DB instead of SQL for metadata?
- Ans: Yes, DynamoDB is a great fit. Partition Key =
short_key. It scales better than MySQL for simple KV lookups.
- Ans: Yes, DynamoDB is a great fit. Partition Key =
- How do you calculate the cost?
- Ans: S3 is ~0.023/GB. 36TB = 828/month. Database storage would be 5-10x more expensive.
- How do you prevent malicious uploads (e.g., malware or illegal content)?
- Ans: Run an async virus scanner (e.g., ClamAV) via Lambda/background worker when an object is uploaded to S3. Implement a reporting system for users.
- What if an S3 bucket is accidentally deleted?
- Ans: Enable S3 Object Versioning and Cross-Region Replication (CRR) for disaster recovery.
- How to support extremely large pastes (100MB+)?
- Ans: Use S3 Multipart Upload directly from the client (via Presigned URLs) to bypass our API gateway limits and avoid blocking server threads.
13. Whiteboard Summary (4 Quadrant)
1. P - Core Requirements
- Paste (Write): 12 QPS
- View (Read): 60 QPS
- Data: 10KB - 10MB blobs
- Split Architecture
2. A - Key Architecture
- Metadata: SQL/DynamoDB (Fast)
- Content: S3 Object Store (Cheap)
- Optimization: Presigned URLs
- Caching: CDN + Redis
3. D - Data Models
- Meta Table: `key`, `user_id`, `s3_path`
- S3 Key: `random_prefix/key`
- Cleanup: Passive (TTL) + Active (Cron)
4. S - Scale & Bottlenecks
- Bandwidth: Use Presigned URLs.
- Latency: Use CDN for reads.
- DB Load: Shard by `short_key`.
- Cost: Tiered Storage (S3 Glacier).