Design a Pastebin

1. What is Pastebin?

Pastebin is a web service where users can upload snippets of text (usually code) and share them via a link. It is widely used by developers to share logs, config files, or code fragments.

Real-World Examples:

  • Pastebin.com: The original.
  • GitHub Gist: Version-controlled snippets.
  • Hastebin: Minimalist version.

[!TIP] Interview Insight: This problem is very similar to “Design a URL Shortener” but with one major difference: Data Size. URL Shortener stores tiny strings (< 1KB). Pastebin stores potentially large text blobs (up to 10MB). This changes the storage layer completely.


2. Requirements & Goals

Functional Requirements

  1. Paste: Users can upload a block of text and get a unique URL.
  2. View: Users can view the paste by accessing the URL.
  3. Expiration: Pastes should expire after a set time (or never).
  4. Custom URL: Optional custom alias.

Non-Functional Requirements

  1. Reliability: Pastes must not be lost (Durability).
  2. Latency: Fast upload and download.
  3. Traffic: Read-heavy, but writes can be significant if used for logging pipelines.

Extended Requirements

  1. Analytics: Track how many times a paste is viewed.
  2. Edit: Allow users to edit pastes (requires account system).
  3. Privacy: Password-protected pastes or “Burn on Read”.

3. Capacity Estimation

Traffic Estimates

  • 1 Million new pastes per day.
  • Write QPS: $1,000,000 / 86400 \approx 12$ writes/sec.
  • Read QPS (5:1 ratio): $\approx 60$ reads/sec.
  • Peak: Assume 5x traffic spikes (e.g., during a major outage where everyone shares logs).

Storage Estimates

  • Average paste size: 10 KB. Max size: 10 MB.
  • Daily Storage: $1M \times 10 \text{ KB} = 10 \text{ GB} / \text{day}$.
  • 10-Year Storage: $10 \text{ GB} \times 365 \times 10 \approx 36 \text{ TB}$.

[!WARNING] 36 TB is significant. We cannot store this all in a single MySQL database instance efficiently. We need a Blob Store.


4. System APIs

Create Paste

POST /api/v1/paste Request Body:

{
  "content": "print('hello world')...", // Only for small pastes
  "expire_minutes": 60,
  "is_private": false
}

Response:

{
  "url": "https://paste.bin/x9K2",
  "key": "x9K2",
  "uploadUrl": "https://s3.aws.com/..." // For large pastes
}

Get Paste

GET /api/v1/paste/{key} Response: Returns the raw text or a HTML view.


5. Database Design: The Split Strategy

This is the most critical part of the design. We must Separate Metadata from Data.

The Wrong Way: Everything in SQL

| ID | Content (TEXT/BLOB) | Created_At | |:——–|:——–|:——–| | 1 | [10MB Data…] | 2023-01-01 |

Why this fails (Buffer Pool Pollution): Databases like MySQL/PostgreSQL use a memory cache called the Buffer Pool to keep frequently accessed data (hot pages) in RAM.

  • If you store large BLOBs (10MB) in the same table, reading a row drags that huge blob into memory.
  • This pushes out hundreds of other small, useful pages (like indexes or user data).
  • Result: Your Cache Hit Ratio drops, Disk I/O spikes, and the DB slows to a crawl.

The Right Way: Metadata + Object Store

We split the data into two stores:

  1. Metadata Database (SQL - labeled “Metadata DB”): Stores lightweight info (User, Expiry, Path).
    • Size: ~100 bytes per row.
    • Speed: Extremely fast lookups.
  2. Object Store (S3 - labeled “Object Store”): Stores the actual content as a file.
    • Size: 1KB to 10MB.
    • Speed: High throughput, cheaper cost.

Metadata SQL Schema:

CREATE TABLE pastes (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    short_key VARCHAR(7) NOT NULL UNIQUE, -- The Base62 key
    user_id BIGINT,                       -- Foreign Key to Users
    s3_path VARCHAR(255) NOT NULL,        -- Pointer to S3 bucket (e.g., /2023/10/abc.txt)
    expiration TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_user_id ON pastes(user_id); -- Fast lookups for "My Pastes"

6. High-Level Design

The core architecture relies on splitting metadata (SQL) from payload (Object Store).

System Architecture: Pastebin
Write Path (Green) vs Read Path (Blue) | Metadata Split
Upload (Write)
View (Read)
Client
LB Layer
App Services
Data Persistence
User
Load Balancer
Paste Service
• Coordinates Upload
• Splits Data
• Handles Reads
KGS
Key Gen
Redis (K)
Pre-gen keys
Metadata DB
SQL Table
• ShortKey
• S3 Path (URL)
Object Store
S3 / GCS Blob
• Content Text
• Scale: PB+
Redis Cache
LRU Cache
• Hot Pastes
Get ID Save Meta Save BLOB Cache Hit? Lookup Path Fetch BLOB

7. Component Design: Presigned URLs

In the design above, the App Server is a bottleneck because it must process the 10MB payload (ingress) and upload it to S3 (egress). This burns CPU and bandwidth.

Optimization: Presigned URLs (Direct Upload) We can let the Client talk directly to S3.

  1. Client requests a “Upload Token” from App Server.
  2. App Server generates a Presigned URL (a temporary, secure link to S3) and gives it to Client.
    • PUT https://my-bucket.s3.amazonaws.com/abc.txt?signature=...
  3. Client uploads the 10MB file directly to S3 using that URL.
  4. App Server is bypassed for the heavy lifting.

Trade-off: Slightly more complex client logic, but massive savings on App Server bandwidth.


8. Data Partitioning & Sharding

As we scale to billions of pastes, the Metadata DB (SQL) needs to be sharded.

Sharding Key

We use the short_key as the sharding key.

  • Strategy: Hash Sharding (hash(short_key) % N).
  • Why?: Ensures even distribution of metadata across database nodes.

Object Store Sharding

S3 handles sharding automatically. However, to optimize S3 performance (which uses prefixes for partitioning), we should random-prefix our keys.

  • Bad: pastes/2023/10/01/abc.txt (Sequential dates create hotspots).
  • Good: pastes/a1b2/abc.txt (Hash prefix spreads load).

9. Reliability, Caching, & Load Balancing

Content Delivery Network (CDN)

For popular pastes (e.g., a viral config file), S3 latency might be too high.

  • Solution: Put a CDN (Cloudfront/Cloudflare) in front of S3.
  • Flow: Client -> CDN -> S3.
  • Cache Headers: Set Cache-Control: public, max-age=86400 for public pastes.

S3 Durability

S3 provides “11 9s” of durability ($99.999999999\%$). It achieves this by replicating data across at least 3 Availability Zones (AZs) automatically. We don’t need to manually replicate blobs.


10. Interactive Decision Visualizer: The Split Write Path

Visualize how the system handles small metadata vs large content.

👤
Client
⚙️
App Server
🗄️
SQL (Meta)
User, Expiry
☁️
Object Store
Blob Content
REQ
BLOB
META
RES
System Ready. Select a mode.

11. System Walkthrough (Dry Run)

Scenario: Creating a Paste (Write)

  1. Client Request: Client sends a 5MB log file to POST /paste.
  2. Auth & Quota: App Server checks if user is logged in and hasn’t exceeded daily limits.
  3. KGS: App Server fetches a unique key abc from Redis KGS.
  4. Split Write:
    • Async: App Server uploads the 5MB file to S3 bucket pastes-bucket at key abc.
    • Sync: App Server inserts metadata (User ID, Expiration, S3 URL) into the SQL DB.
  5. Response: App Server returns https://pastebin.com/abc.

Scenario: Viewing a Paste (Read)

  1. Client Request: User visits https://pastebin.com/abc.
  2. CDN Check: Request hits Cloudfront. If cached, return immediately (10ms).
  3. App Server: If CDN miss, request hits App Server.
  4. Metadata Lookup: App Server queries SQL: SELECT s3_url FROM pastes WHERE short_key = 'abc'.
  5. Content Fetch: App Server streams the file from S3 (or generates a signed URL for the client to fetch).
  6. Render: App Server wraps the content in HTML (syntax highlighting) and returns it.

12. Interview Gauntlet

  1. Why not store the text in the Database?
    • Ans: Buffer Pool Pollution. Large blobs evict hot pages (indexes) from RAM, destroying performance.
  2. How do you handle “Burn on Read” (Delete after view)?
    • Ans: Add a burn_count flag. On read, check the flag. If set, return data and immediately trigger an async job to delete from DB and S3.
  3. How do you encrypt the data?
    • Ans: Encrypt at rest (S3 Server-Side Encryption) and in transit (TLS). For End-to-End encryption, the Client must encrypt before uploading, and the Server never sees the key (e.g., key is in the URL hash fragment #key which is not sent to server).
  4. What if the S3 upload fails but DB insert succeeds?
    • Ans: We end up with a “Ghost Paste” (Metadata exists, but content is missing).
    • Fix: Upload to S3 first. Only if successful, insert into DB. If S3 succeeds but DB fails, we have an orphaned file (handled by S3 Lifecycle rules or a cleanup job).
  5. How do we handle popular pastes?
    • Ans: Use a CDN. Cache the HTML response.
  6. Can we use a NoSQL DB instead of SQL for metadata?
    • Ans: Yes, DynamoDB is a great fit. Partition Key = short_key. It scales better than MySQL for simple KV lookups.
  7. How do you calculate the cost?
    • Ans: S3 is ~$0.023/GB. 36TB = $828/month. Database storage would be 5-10x more expensive.

13. Whiteboard Summary

1. Core Requirements

  • Paste (Write): 12 QPS
  • View (Read): 60 QPS
  • Data: 10KB - 10MB blobs
  • Split Architecture

2. Key Architecture

  • Metadata: SQL/DynamoDB (Fast)
  • Content: S3 Object Store (Cheap)
  • Optimization: Presigned URLs
  • Caching: CDN + Redis

3. Data Models

  • Meta Table: `key`, `user_id`, `s3_path`
  • S3 Key: `random_prefix/key`
  • Cleanup: Passive (TTL) + Active (Cron)

4. Bottlenecks & Fixes

  • Bandwidth: Use Presigned URLs.
  • Latency: Use CDN for reads.
  • DB Load: Shard by `short_key`.
  • Cost: Tiered Storage (S3 Glacier).