Back of Envelope Estimation

[!IMPORTANT] Listen closely. You’re in a Google interview, and the question is simple: “How much storage does YouTube need per day?” If you freeze, you’ve already lost. Is it Terabytes? Petabytes? The engineer next to you calmly states: “500 hours of video uploaded per minute… 1 GB per hour of video… that’s about 720 TB per day.” They got the offer. You didn’t. The difference? They knew how to estimate under pressure.

Back-of-the-envelope estimation isn’t about getting the exact answer — it’s about proving to me and the rest of the staff team that you can think quantitatively and make rational architecture decisions based on physical constraints.

In this lesson, you will master:

  1. The Latency Hierarchy: Memorizing the 6 orders of magnitude from L1 Cache (0.5ns) to Global Network (150ms) — a 300,000,000x difference.
  2. The QPS Shortcut: The “magic rule” that 1 Million requests/day ≈ 12 QPS, and how to derive it in 5 seconds.
  3. The 3-Step Estimation Framework: Assumptions → Math → Architectural Impact — the same framework used by Jeff Dean at Google.

1. Don’t Guess, Estimate.

You are in an interview. The interviewer asks: “Design Instagram. By the way, how much storage do we need for 5 years?” Do you panic? Do you guess “100 Terabytes”? No. You pull out a napkin (or the whiteboard) and do Back of the Envelope Estimation.

Google’s Jeff Dean (the architect behind MapReduce, BigTable, Spanner) famously said that every engineer should know the “latency numbers” by heart. Why? Because designing a system without knowing the numbers is like building a bridge without knowing how much a truck weighs.


2. The Estimation Workflow

A successful estimate isn’t about being 100% accurate; it’s about being within the right Order of Magnitude.

Step 1: Assumptions
DAU, Requests per user, Read/Write ratio.
Step 2: Math
Round everything to Powers of 10 or 2.
Step 3: Impact
"This means we need Sharding / Caching."

[!NOTE] Hardware Reality: In Step 3, if your math shows a database requires 20,000 IOPS but a standard HDD only provides 100 IOPS, you have just proved that sharding or SSD-upgrading is mandatory. This is why we estimate.


3. Latency Numbers Every Programmer Should Know

You don’t need to memorize exact nanoseconds, but you MUST know the orders of magnitude.

Operation Time (Approx) Human Equivalent (if 1 CPU cycle = 1 sec)
L1 Cache 0.5 ns Heartbeat (0.5 s)
Main Memory (RAM) 100 ns Brushing Teeth (1.5 min)
SSD Random Read 150 μs Weekend Trip (1.7 days)
Round Trip (Same Data Center) 500 μs Week-long Vacation (6 days)
Disk Seek 10 ms Semester in College (4 months)
Packet CA → NL 150 ms 5 Years

A. Raw Hardware Throughput (The “Speed Limits”)

Knowing how long a request takes is only half the battle. You must also know how much data can move per second.

Medium Throughput (Approx) Context
L1 Cache ~1 TB/s CPU internal speed.
Main Memory (RAM) ~100 GB/s DDR5 territory.
NVMe SSD ~7 GB/s Modern cloud drives (e.g., AWS io2).
10Gbps Network 1.25 GB/s High-performance service backbone.
SATA SSD 500 MB/s Legacy cloud storage (e.g., AWS gp2).
Spinning Disk (HDD) 150 MB/s Bulk storage only.

[!TIP] Staff Insight: If your math says you need to move 50GB/s between two services, you can’t use the network (1.25 GB/s limit). You either need to combine the services or use shared memory.

Interactive: The Latency Time Machine

Drag the slider to see how “Computer Time” translates to “Human Time”. If you can’t avoid the disk, you are already “months” late.

[!TIP] Try it yourself: Move the slider to compare nanoseconds (CPU) vs milliseconds (Network) in human terms.

Latency Time Machine
L1 RAM SSD LAN HDD WAN
System Time
0.5 ns
L1 Cache
Human Reality
0.5 sec
A Heartbeat

4. Interactive: The Dynamic Capacity & Cost Planner

In an interview, you need to calculate QPS and Storage fast. Bonus points if you can estimate the AWS Bill.

[!TIP] Try it yourself: Change the “DAU” or “Writes Per User” to see how quickly storage needs and costs explode.

System Capacity Calculator
Write QPS
--
Read QPS
--
Storage / Yr
--
S3 Cost / Mo
--

A. Little’s Law: The Concurrency Formula

High-performance systems engineers use Little’s Law to determine how many concurrent “items” (threads, connections, or buffers) the system needs.

L = λ × W
  • L: Average number of items in the system (Concurrency).
  • λ (Lambda): Average arrival rate (Throughput/QPS).
  • W: Average time an item spends in the system (Latency).

Example: If your API handles 1,000 QPS and each request takes 200ms (0.2s), you need exactly 1,000 × 0.2 = 200 concurrent worker threads/connections at all times.

B. The Hidden AWS Bill: Egress Costs

When estimating costs, juniors count CPUs and Storage. Staff Engineers count Egress.

  • Storage (S3): ~$23 per Terabyte/Month.
  • Egress (Outbound): ~$90 per Terabyte/Month.

War Story: The $100,000 Egress Mistake

At a fast-growing startup, an engineer designed a log-aggregation system that sent uncompressed debug logs across AWS availability zones to a centralized bucket. They correctly estimated the storage needed (about 1PB per month) and calculated a manageable $23,000/month S3 bill. However, they completely forgot to estimate the cross-AZ data transfer (egress) cost. When the bill arrived, the company was charged over $100,000 just for the network transfer. A quick back-of-the-envelope estimation on throughput cost could have caught this instantly, prompting them to compress or filter logs locally before sending.

[!WARNING] Trap: Data transfer is often 4x to 10x more expensive than storage. If you design a system that “moves” 1PB of data to users per month, your storage bill is only $23,000, but your network bill is over $90,000. Always estimate egress!


5. The Power of 2 (The Magic Numbers)

In System Design, we approximate everything to powers of 2. It simplifies math significantly.

[!TIP] Pro Tip: Memorize that 210 ≈ 103 (1000). This is the key to converting between binary and decimal.

Power Approximation Unit
210 1 Thousand (103) 1 KB
220 1 Million (106) 1 MB
230 1 Billion (109) 1 GB
240 1 Trillion (1012) 1 TB
250 1 Quadrillion (1015) 1 PB

Interactive: The Storage Converter

How big is 1 Petabyte really? Enter a value to find out.

[!TIP] Try it yourself: Type a number and select a unit to see what it equals in real-world analogies.

--
Tweets (1KB)
--
Photos (2MB)
--
HD Movies (4GB)
--
Human Brains (~2.5PB)

6. Common Mistakes to Avoid

Even senior engineers make these errors under pressure.

6.1 Bits (b) vs Bytes (B)

  • Network bandwidth is usually in Bits (Gbps).
  • Storage is in Bytes (GB).
  • Example: If you have a 1 Gbps connection, you can download 1 GB in 8 seconds, not 1 second.
  • 1 Gigabit = 125 Megabytes.

6.2 QPS vs Concurrent Users

  • Concurrent Users: Number of people on the site right now.
  • QPS: Number of requests hitting the server per second.
  • The Trap: 1 million concurrent users ≠ 1 million QPS. If a user clicks once every 10 seconds, that’s only 100k QPS.

7. Walkthrough: Estimating Twitter Storage

Let’s apply this to a real interview question: “Estimate the storage for Twitter for 5 years.”

Step 1: Traffic Assumptions

  • DAU: 300 Million.
  • Tweets/Day: 2 per user.
  • Total Tweets: 300M × 2 = 600M tweets/day.

Step 2: Size Assumptions

  • Tweet ID: 8 Bytes.
  • User ID: 8 Bytes.
  • Text (140 chars): ~300 Bytes (incl. encoding).
  • Media: 10% of tweets have photos (500 KB each).

Step 3: The Calculation

  1. Text Storage: 600M × 300 Bytes = 180 GB / Day.
  2. Media Storage: 60M × 500 KB = 30 TB / Day.
  3. 5-Year Total (Media): 30 TB × 365 × 5 ≈ 54 PB.

[!IMPORTANT] Conclusion: 54 Petabytes! This immediately tells you that you cannot use a single SQL database. You need a Distributed File System (like HDFS) for media and Sharded Databases for the text.


8. Common Formulas & “Magic Rules”

The QPS Shortcut

  • 1 Million requests per day ≈ 12 QPS.
  • 10 Million requests per day ≈ 120 QPS.
  • 100 Million requests per day ≈ 1200 QPS.

Memory vs. Disk

  • If your active dataset fits in RAM, your system will be 1000× faster.
  • Standard rule: 20% of data is “hot” (accessed 80% of the time). Cache that 20%.

9. Summary Checklist

  • Know the orders of magnitude for L1, RAM, SSD, and Network. Mnemonic: “L-R-S-N”“Lazy Rabbits Seek Naps” — each 1000x slower than the last.
  • Use Powers of 2 for easy mental math. Key anchor: 210 ≈ 103.
  • The QPS Shortcut: 1M req/day ÷ 86,400 sec ≈ 12 QPS. Memorize 86,400 (seconds in a day).
  • Always design for Peak QPS (Average × 5).
  • Verify if the total dataset fits in RAM to maximize performance. The 80/20 Rule: cache the hot 20%.
  • Translate technical storage (PB) into business costs ($$$) to impress interviewers.

Staff Engineer Tip: The “86,400 Shortcut”. The single most useful number in system design is 86,400 — the number of seconds in a day. Every estimation starts here. In your interview, say it out loud: “There are roughly 100,000 seconds in a day — let me round to 105 for simplicity.” This instantly converts DAU into QPS: “10 million DAU × 2 actions = 20M requests/day ÷ 105 = 200 QPS.” Interviewers love this because it shows you can reason about scale without a calculator. Practice this conversion until it’s muscle memory.