Design a Pastebin

1. What is Pastebin?

Pastebin is a web service where users can upload snippets of text (usually code) and share them via a link. It is widely used by developers to share logs, config files, or code fragments.

Real-World Examples:

Pastebin.com: The original.
GitHub Gist: Version-controlled snippets.
Hastebin: Minimalist version.

[!TIP] Interview Insight: This problem is very similar to “Design a URL Shortener” but with one major difference: Data Size. URL Shortener stores tiny strings (< 1KB). Pastebin stores potentially large text blobs (up to 10MB). This changes the storage layer completely.

2. Requirements & Goals

Functional Requirements

Paste: Users can upload a block of text and get a unique URL.
View: Users can view the paste by accessing the URL.
Expiration: Pastes should expire after a set time (or never).
Custom URL: Optional custom alias.

Non-Functional Requirements

Reliability: Pastes must not be lost (Durability).
Latency: Fast upload and download.
Traffic: Read-heavy, but writes can be significant if used for logging pipelines.

Extended Requirements

Analytics: Track how many times a paste is viewed.
Edit: Allow users to edit pastes (requires account system).
Privacy: Password-protected pastes or “Burn on Read”.

3. Capacity Estimation

Traffic Estimates

1 Million new pastes per day.
Write QPS: $1,000,000 / 86400 \approx 12$ writes/sec.
Read QPS (5:1 ratio): $\approx 60$ reads/sec.
Peak: Assume 5x traffic spikes (e.g., during a major outage where everyone shares logs).

Storage Estimates

Average paste size: 10 KB. Max size: 10 MB.
Daily Storage: $1M \times 10 \text{ KB} = 10 \text{ GB} / \text{day}$.
10-Year Storage: $10 \text{ GB} \times 365 \times 10 \approx 36 \text{ TB}$.

[!WARNING] 36 TB is significant. We cannot store this all in a single MySQL database instance efficiently. We need a Blob Store.

4. System APIs

Create Paste

POST /api/v1/paste Request Body:

{
  "content": "print('hello world')...", // Only for small pastes
  "expire_minutes": 60,
  "is_private": false
}

Response:

{
  "url": "https://paste.bin/x9K2",
  "key": "x9K2",
  "uploadUrl": "https://s3.aws.com/..." // For large pastes
}

Get Paste

GET /api/v1/paste/{key} Response: Returns the raw text or a HTML view.

5. Database Design: The Split Strategy

This is the most critical part of the design. We must Separate Metadata from Data.

The Wrong Way: Everything in SQL

| ID | Content (TEXT/BLOB) | Created_At | |:——–|:——–|:——–| | 1 | [10MB Data…] | 2023-01-01 |

Why this fails (Buffer Pool Pollution): Databases like MySQL/PostgreSQL use a memory cache called the Buffer Pool to keep frequently accessed data (hot pages) in RAM.

If you store large BLOBs (10MB) in the same table, reading a row drags that huge blob into memory.
This pushes out hundreds of other small, useful pages (like indexes or user data).
Result: Your Cache Hit Ratio drops, Disk I/O spikes, and the DB slows to a crawl.

The Right Way: Metadata + Object Store

We split the data into two stores:

Metadata Database (SQL - labeled “Metadata DB”): Stores lightweight info (User, Expiry, Path).
- Size: ~100 bytes per row.
- Speed: Extremely fast lookups.
Object Store (S3 - labeled “Object Store”): Stores the actual content as a file.
- Size: 1KB to 10MB.
- Speed: High throughput, cheaper cost.

Metadata SQL Schema:

CREATE TABLE pastes (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    short_key VARCHAR(7) NOT NULL UNIQUE, -- The Base62 key
    user_id BIGINT,                       -- Foreign Key to Users
    s3_path VARCHAR(255) NOT NULL,        -- Pointer to S3 bucket (e.g., /2023/10/abc.txt)
    expiration TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_user_id ON pastes(user_id); -- Fast lookups for "My Pastes"

6. High-Level Design

The core architecture relies on splitting metadata (SQL) from payload (Object Store).

System Architecture: Pastebin

Write Path (Green) vs Read Path (Blue) | Metadata Split

Upload (Write)

View (Read)

Client

LB Layer

App Services

Data Persistence

User

Load Balancer

Paste Service

• Coordinates Upload
• Splits Data
• Handles Reads

KGS

Key Gen

Redis (K)

Pre-gen keys

Metadata DB

SQL Table
• ShortKey
• S3 Path (URL)

Object Store

S3 / GCS Blob
• Content Text
• Scale: PB+

Redis Cache

LRU Cache
• Hot Pastes

7. Component Design: Presigned URLs

In the design above, the App Server is a bottleneck because it must process the 10MB payload (ingress) and upload it to S3 (egress). This burns CPU and bandwidth.

Optimization: Presigned URLs (Direct Upload) We can let the Client talk directly to S3.

Client requests a “Upload Token” from App Server.
App Server generates a Presigned URL (a temporary, secure link to S3) and gives it to Client.
- PUT https://my-bucket.s3.amazonaws.com/abc.txt?signature=...
Client uploads the 10MB file directly to S3 using that URL.
App Server is bypassed for the heavy lifting.

Trade-off: Slightly more complex client logic, but massive savings on App Server bandwidth.

8. Data Partitioning & Sharding

As we scale to billions of pastes, the Metadata DB (SQL) needs to be sharded.

Sharding Key

We use the short_key as the sharding key.

Strategy: Hash Sharding (hash(short_key) % N).
Why?: Ensures even distribution of metadata across database nodes.

Object Store Sharding

S3 handles sharding automatically. However, to optimize S3 performance (which uses prefixes for partitioning), we should random-prefix our keys.

Bad: pastes/2023/10/01/abc.txt (Sequential dates create hotspots).
Good: pastes/a1b2/abc.txt (Hash prefix spreads load).

9. Reliability, Caching, & Load Balancing

Content Delivery Network (CDN)

For popular pastes (e.g., a viral config file), S3 latency might be too high.

Solution: Put a CDN (Cloudfront/Cloudflare) in front of S3.
Flow: Client -> CDN -> S3.
Cache Headers: Set Cache-Control: public, max-age=86400 for public pastes.

S3 Durability

S3 provides “11 9s” of durability ($99.999999999\%$). It achieves this by replicating data across at least 3 Availability Zones (AZs) automatically. We don’t need to manually replicate blobs.

10. Interactive Decision Visualizer: The Split Write Path

Visualize how the system handles small metadata vs large content.

👤

Client

⚙️

App Server

🗄️

SQL (Meta)

User, Expiry

☁️

Object Store

Blob Content

REQ

BLOB

11. System Walkthrough (Dry Run)

Scenario: Creating a Paste (Write)

Client Request: Client sends a 5MB log file to POST /paste.
Auth & Quota: App Server checks if user is logged in and hasn’t exceeded daily limits.
KGS: App Server fetches a unique key abc from Redis KGS.
Split Write:
- Async: App Server uploads the 5MB file to S3 bucket pastes-bucket at key abc.
- Sync: App Server inserts metadata (User ID, Expiration, S3 URL) into the SQL DB.
Response: App Server returns https://pastebin.com/abc.

Scenario: Viewing a Paste (Read)

Client Request: User visits https://pastebin.com/abc.
CDN Check: Request hits Cloudfront. If cached, return immediately (10ms).
App Server: If CDN miss, request hits App Server.
Metadata Lookup: App Server queries SQL: SELECT s3_url FROM pastes WHERE short_key = 'abc'.
Content Fetch: App Server streams the file from S3 (or generates a signed URL for the client to fetch).
Render: App Server wraps the content in HTML (syntax highlighting) and returns it.

12. Interview Gauntlet

Why not store the text in the Database?
- Ans: Buffer Pool Pollution. Large blobs evict hot pages (indexes) from RAM, destroying performance.
How do you handle “Burn on Read” (Delete after view)?
- Ans: Add a burn_count flag. On read, check the flag. If set, return data and immediately trigger an async job to delete from DB and S3.
How do you encrypt the data?
- Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment #key which is not sent to server).
What if the S3 upload fails but DB insert succeeds?
- Ans: We end up with a “Ghost Paste” (Metadata exists, but content is missing).
- Fix: Upload to S3 first. Only if successful, insert into DB. If S3 succeeds but DB fails, we have an orphaned file (handled by S3 Lifecycle rules or a cleanup job).
How do we handle popular pastes?
- Ans: Use a CDN. Cache the HTML response.
Can we use a NoSQL DB instead of SQL for metadata?
- Ans: Yes, DynamoDB is a great fit. Partition Key = short_key. It scales better than MySQL for simple KV lookups.
How do you calculate the cost?
- Ans: S3 is ~$0.023/GB. 36TB = $828/month. Database storage would be 5-10x more expensive.

13. Whiteboard Summary

1. Core Requirements

Paste (Write): 12 QPS
View (Read): 60 QPS
Data: 10KB - 10MB blobs
Split Architecture

2. Key Architecture

Metadata: SQL/DynamoDB (Fast)
Content: S3 Object Store (Cheap)
Optimization: Presigned URLs
Caching: CDN + Redis

3. Data Models

Meta Table: `key`, `user_id`, `s3_path`
S3 Key: `random_prefix/key`
Cleanup: Passive (TTL) + Active (Cron)

4. Bottlenecks & Fixes

Bandwidth: Use Presigned URLs.
Latency: Use CDN for reads.
DB Load: Shard by `short_key`.
Cost: Tiered Storage (S3 Glacier).