Design a Pastebin
1. What is Pastebin?
Pastebin is a web service where users can upload snippets of text (usually code) and share them via a link. It is widely used by developers to share logs, config files, or code fragments.
Real-World Examples:
- Pastebin.com: The original.
- GitHub Gist: Version-controlled snippets.
- Hastebin: Minimalist version.
[!TIP] Interview Insight: This problem is very similar to “Design a URL Shortener” but with one major difference: Data Size. URL Shortener stores tiny strings (< 1KB). Pastebin stores potentially large text blobs (up to 10MB). This changes the storage layer completely.
2. Requirements & Goals
Functional Requirements
- Paste: Users can upload a block of text and get a unique URL.
- View: Users can view the paste by accessing the URL.
- Expiration: Pastes should expire after a set time (or never).
- Custom URL: Optional custom alias.
Non-Functional Requirements
- Reliability: Pastes must not be lost (Durability).
- Latency: Fast upload and download.
- Traffic: Read-heavy, but writes can be significant if used for logging pipelines.
Extended Requirements
- Analytics: Track how many times a paste is viewed.
- Edit: Allow users to edit pastes (requires account system).
- Privacy: Password-protected pastes or “Burn on Read”.
3. Capacity Estimation
Traffic Estimates
- 1 Million new pastes per day.
- Write QPS: $1,000,000 / 86400 \approx 12$ writes/sec.
- Read QPS (5:1 ratio): $\approx 60$ reads/sec.
- Peak: Assume 5x traffic spikes (e.g., during a major outage where everyone shares logs).
Storage Estimates
- Average paste size: 10 KB. Max size: 10 MB.
- Daily Storage: $1M \times 10 \text{ KB} = 10 \text{ GB} / \text{day}$.
- 10-Year Storage: $10 \text{ GB} \times 365 \times 10 \approx 36 \text{ TB}$.
[!WARNING] 36 TB is significant. We cannot store this all in a single MySQL database instance efficiently. We need a Blob Store.
4. System APIs
Create Paste
POST /api/v1/paste
Request Body:
{
"content": "print('hello world')...", // Only for small pastes
"expire_minutes": 60,
"is_private": false
}
Response:
{
"url": "https://paste.bin/x9K2",
"key": "x9K2",
"uploadUrl": "https://s3.aws.com/..." // For large pastes
}
Get Paste
GET /api/v1/paste/{key}
Response: Returns the raw text or a HTML view.
5. Database Design: The Split Strategy
This is the most critical part of the design. We must Separate Metadata from Data.
The Wrong Way: Everything in SQL
| ID | Content (TEXT/BLOB) | Created_At | |:——–|:——–|:——–| | 1 | [10MB Data…] | 2023-01-01 |
Why this fails (Buffer Pool Pollution): Databases like MySQL/PostgreSQL use a memory cache called the Buffer Pool to keep frequently accessed data (hot pages) in RAM.
- If you store large BLOBs (10MB) in the same table, reading a row drags that huge blob into memory.
- This pushes out hundreds of other small, useful pages (like indexes or user data).
- Result: Your Cache Hit Ratio drops, Disk I/O spikes, and the DB slows to a crawl.
The Right Way: Metadata + Object Store
We split the data into two stores:
- Metadata Database (SQL - labeled “Metadata DB”): Stores lightweight info (User, Expiry, Path).
- Size: ~100 bytes per row.
- Speed: Extremely fast lookups.
- Object Store (S3 - labeled “Object Store”): Stores the actual content as a file.
- Size: 1KB to 10MB.
- Speed: High throughput, cheaper cost.
Metadata SQL Schema:
CREATE TABLE pastes (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
short_key VARCHAR(7) NOT NULL UNIQUE, -- The Base62 key
user_id BIGINT, -- Foreign Key to Users
s3_path VARCHAR(255) NOT NULL, -- Pointer to S3 bucket (e.g., /2023/10/abc.txt)
expiration TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_user_id ON pastes(user_id); -- Fast lookups for "My Pastes"
6. High-Level Design
The core architecture relies on splitting metadata (SQL) from payload (Object Store).
• Splits Data
• Handles Reads
• ShortKey
• S3 Path (URL)
• Content Text
• Scale: PB+
• Hot Pastes
7. Component Design: Presigned URLs
In the design above, the App Server is a bottleneck because it must process the 10MB payload (ingress) and upload it to S3 (egress). This burns CPU and bandwidth.
Optimization: Presigned URLs (Direct Upload) We can let the Client talk directly to S3.
- Client requests a “Upload Token” from App Server.
- App Server generates a Presigned URL (a temporary, secure link to S3) and gives it to Client.
PUT https://my-bucket.s3.amazonaws.com/abc.txt?signature=...
- Client uploads the 10MB file directly to S3 using that URL.
- App Server is bypassed for the heavy lifting.
Trade-off: Slightly more complex client logic, but massive savings on App Server bandwidth.
8. Data Partitioning & Sharding
As we scale to billions of pastes, the Metadata DB (SQL) needs to be sharded.
Sharding Key
We use the short_key as the sharding key.
- Strategy: Hash Sharding (
hash(short_key) % N). - Why?: Ensures even distribution of metadata across database nodes.
Object Store Sharding
S3 handles sharding automatically. However, to optimize S3 performance (which uses prefixes for partitioning), we should random-prefix our keys.
- Bad:
pastes/2023/10/01/abc.txt(Sequential dates create hotspots). - Good:
pastes/a1b2/abc.txt(Hash prefix spreads load).
9. Reliability, Caching, & Load Balancing
Content Delivery Network (CDN)
For popular pastes (e.g., a viral config file), S3 latency might be too high.
- Solution: Put a CDN (Cloudfront/Cloudflare) in front of S3.
- Flow: Client -> CDN -> S3.
- Cache Headers: Set
Cache-Control: public, max-age=86400for public pastes.
S3 Durability
S3 provides “11 9s” of durability ($99.999999999\%$). It achieves this by replicating data across at least 3 Availability Zones (AZs) automatically. We don’t need to manually replicate blobs.
10. Interactive Decision Visualizer: The Split Write Path
Visualize how the system handles small metadata vs large content.
11. System Walkthrough (Dry Run)
Scenario: Creating a Paste (Write)
- Client Request: Client sends a 5MB log file to
POST /paste. - Auth & Quota: App Server checks if user is logged in and hasn’t exceeded daily limits.
- KGS: App Server fetches a unique key
abcfrom Redis KGS. - Split Write:
- Async: App Server uploads the 5MB file to S3 bucket
pastes-bucketat keyabc. - Sync: App Server inserts metadata (User ID, Expiration, S3 URL) into the SQL DB.
- Async: App Server uploads the 5MB file to S3 bucket
- Response: App Server returns
https://pastebin.com/abc.
Scenario: Viewing a Paste (Read)
- Client Request: User visits
https://pastebin.com/abc. - CDN Check: Request hits Cloudfront. If cached, return immediately (10ms).
- App Server: If CDN miss, request hits App Server.
- Metadata Lookup: App Server queries SQL:
SELECT s3_url FROM pastes WHERE short_key = 'abc'. - Content Fetch: App Server streams the file from S3 (or generates a signed URL for the client to fetch).
- Render: App Server wraps the content in HTML (syntax highlighting) and returns it.
12. Interview Gauntlet
- Why not store the text in the Database?
- Ans: Buffer Pool Pollution. Large blobs evict hot pages (indexes) from RAM, destroying performance.
- How do you handle “Burn on Read” (Delete after view)?
- Ans: Add a
burn_countflag. On read, check the flag. If set, return data and immediately trigger an async job to delete from DB and S3.
- Ans: Add a
- How do you encrypt the data?
- Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment
#keywhich is not sent to server).
- Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment
- What if the S3 upload fails but DB insert succeeds?
- Ans: We end up with a “Ghost Paste” (Metadata exists, but content is missing).
- Fix: Upload to S3 first. Only if successful, insert into DB. If S3 succeeds but DB fails, we have an orphaned file (handled by S3 Lifecycle rules or a cleanup job).
- How do we handle popular pastes?
- Ans: Use a CDN. Cache the HTML response.
- Can we use a NoSQL DB instead of SQL for metadata?
- Ans: Yes, DynamoDB is a great fit. Partition Key =
short_key. It scales better than MySQL for simple KV lookups.
- Ans: Yes, DynamoDB is a great fit. Partition Key =
- How do you calculate the cost?
- Ans: S3 is ~$0.023/GB. 36TB = $828/month. Database storage would be 5-10x more expensive.
13. Whiteboard Summary
1. Core Requirements
- Paste (Write): 12 QPS
- View (Read): 60 QPS
- Data: 10KB - 10MB blobs
- Split Architecture
2. Key Architecture
- Metadata: SQL/DynamoDB (Fast)
- Content: S3 Object Store (Cheap)
- Optimization: Presigned URLs
- Caching: CDN + Redis
3. Data Models
- Meta Table: `key`, `user_id`, `s3_path`
- S3 Key: `random_prefix/key`
- Cleanup: Passive (TTL) + Active (Cron)
4. Bottlenecks & Fixes
- Bandwidth: Use Presigned URLs.
- Latency: Use CDN for reads.
- DB Load: Shard by `short_key`.
- Cost: Tiered Storage (S3 Glacier).