3. Capacity Estimation: Building intuition behind 3. capacity estimation.
1. What is a Web Crawler?
A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.
Use Cases
Search Engine Indexing: Google, Bing.
Web Archiving: Wayback Machine.
Web Mining: Financial analysis, Copyright checks, LLM Training Data.
Try it yourself: Open robots.txt of any major site (e.g., google.com/robots.txt). See the Disallow rules. This is what the crawler must respect.
2. Requirements & Goals
Functional Requirements
Crawl: Fetch content from a given URL.
Extract: Parse HTML and find new links (<a href>).
Store: Save the content for the Indexer.
Non-Functional Requirements
Scalability: The web is huge (> billions of pages).
Politeness: DO NOT DDoS the target servers. Respect robots.txt.
This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).
Storage
Avg page size: 100 KB (HTML only).
Storage = 1B × 100 KB = 100 TB / month.
In 5 years: 6 PB. (We need BigTable / HBase).
4. System APIs
The Crawler is an internal system, but the components communicate via RPC (gRPC).
4.1 Frontier Service (Task Queue)
serviceFrontier{// Workers ask for a batch of URLs to crawlrpcGetCrawlTasks(WorkerID)returns(streamCrawlTask);// Workers report found linksrpcReportLinks(ReportRequest)returns(Ack);}messageCrawlTask{stringurl=1;stringdomain=2;int32depth=3;}
4.2 HTML Fetcher Service
POST /fetch
{
"url": "https://example.com",
"timeout": 5000,
"follow_redirects": true
}
1. **Seed URLs**: Manually curated list (cnn.com, wikipedia.org).
2. **URL Frontier**: Distributes URLs to workers (Prioritizer & Politeness).
3. **HTML Downloader**: Fetches pages using **Async DNS**.
4. **Content Parser**: Validates HTML, extracts links.
5. **Duplicate Eliminator**: Checks Bloom Filter and SimHash (Content Dedup).
6. **Storage**: Saves content to BigTable/HBase.
7. **Loop**: New links go back to Frontier.
---
## 7. Component Design (Deep Dive)
### The URL Frontier (Priority vs Politeness)
Visualized in the **URL Frontier** zone of the diagram. This is the most complex part. It has two main buffers:
1. **Front Queue (Prioritizer)**:
* Prioritize URLs based on PageRank or change frequency.
* High Priority Q, Medium Q, Low Q.
2. **Back Queue (Politeness Router)**:
* **The Rule**: Do not hit `wikipedia.org` more than once per second.
* We map `wikipedia.org` to `Queue #5`.
* A "Queue Router" ensures that `Queue #5` is only consumed by one worker thread, enforcing the delay.
* This implements a **Rate Limiter**. See [Module 10: Simple Services (Rate Limiter)](/system-design/10-simple-services/03-api-rate-limiter/).
### Robot Exclusion Standard (Ethics)
Before crawling `example.com`, we fetch `example.com/robots.txt`.
* **Rules**: `User-agent: * Disallow: /private/`.
* **Legal**: While not a law, ignoring it gets you IP banned and sued.
* **Implementation**: We cache the `Robots.txt` for every domain in a KV Store (Redis) with a TTL.
### Bloom Filters (Mathematical Magic)
A Bloom Filter is a bit array of size m with k hash functions.
* **Add Item**: Hash it k times, set bits to 1.
* **Check Item**: Hash it k times. If *all* bits are 1, it *might* exist. If *any* bit is 0, it *definitely* does not exist.
* **False Positive**: We might think we visited a page when we haven't. This is acceptable for a crawler (we just miss one update).
* **Space Savings**: Stores billions of URLs in RAM.
### DNS Resolver Caching (Why Custom?)
Visualized as **DNS Resolver** in the diagram.
DNS lookups take 10ms - 200ms. If we crawl 100 pages from `nytimes.com`, we don't want 100 DNS requests.
* **The OS Problem**: Standard libc `getaddrinfo` is **Blocking**. If you have 1000 threads, and 500 are waiting on DNS, your CPU is idle.
* **The Solution**: Build a custom **Async DNS Resolver** over UDP. We fire a UDP packet, register a callback, and move on. No thread blocking.
* **Caching**: We cache `hostname → IP` in memory.
### Content Deduplication (SimHash)
How do we know if two pages are 99% similar (Near-Duplicate)?
* **Exact Match (MD5)**: Useless. Changing one character ("the" → "teh") changes the entire hash.
* SimHash Algorithm:
1. **Tokenize**: Break text into words/features.
2. **Hash**: Hash each word into 64-bits.
3. **Vector Sum**: Start with a vector `V = [0,0...0]`. For each bit in the word hash: if 1, add weight to `V[i]`; if 0, subtract weight.
4. **Fingerprint**: If `V[i] > 0`, set final bit to `1`. Else `0`.
* **Hamming Distance**: If the SimHash of Doc A and Doc B differ by only 2 or 3 bits, they are near-duplicates. We discard the new one.
---
## 8. Requirements Traceability
Requirement
Design Decision
Justification
Politeness
Politeness Queue / Router
Ensures we never exceed QPS limits for a target domain.
Throughput
Async DNS Resolver
Prevents thread blocking during DNS lookups, increasing concurrency.
Scalability
Bloom Filter
Tracks billions of visited URLs in RAM to prevent cycles.
Storage
BigTable / HBase
NoSQL Wide Column Store handles massive write throughput and PB of data.
Freshness
Priority Queue (PageRank)
Ensures important pages (high PageRank) are crawled more frequently.
---
## 9. Observability & Metrics
We need to ensure the crawler isn't stuck or banned.
### Key Metrics
* **Crawl Rate**: Pages/sec per worker.
* **DNS Latency**: Avg time to resolve IP. Spike indicates DNS throttling.
* **Error Rate**: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy).
* **Queue Depth**: If Front Queue grows but Back Queue is empty, the Router is slow.
### Trap Detection
> [!NOTE]
> **War Story: The Spider Trap Loop**
> In the early days, a major search engine crawler encountered dynamically generated calendar links (where "Next Month" produced a new unique URL infinitely). This created an infinite loop, wasting massive bandwidth and storage. They solved this by implementing strict path-depth limits (`max_depth = 15`) and detecting repeating path segments to dynamically prune the crawl graph.
* **Path Depth**: Alert if `url_depth > 20`.
* **Domain Concentration**: Alert if 50% of traffic is hitting one domain (Spider Trap).
---
## 10. Deployment Strategy
### Rolling Updates
We have thousands of worker nodes.
1. **Drain**: Stop assigning new tasks to Worker A.
2. **Wait**: Wait for active crawls to finish (graceful shutdown).
3. **Update**: Deploy new binary.
4. **Resume**: Re-register with Frontier.
### Resiliency
If a worker crashes:
* The Frontier detects heartbeat failure.
* The URLs assigned to that worker are re-queued to another worker.
---
## 11. Interview Gauntlet
### Rapid Fire Questions
1. **How do you handle 'robots.txt'?** We cache it for every domain. Before crawling, we check the rules. We respect `User-agent: * Disallow: /admin`.
2. **What if the page renders with JavaScript (React/Angular)?** Standard HTTP Get returns empty HTML. We need a "Headless Browser" (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive.
3. **How do you detect infinite loops?** Bloom Filters for visited URLs and Max Depth limit.
4. **Why not use a standard Hash Set instead of Bloom Filter?** 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB.
5. **How do you update 'SimHash' efficiently?** We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle).
---
## 12. Interactive Decision Visualizer: The Bloom Filter
This simulator demonstrates how a **Bloom Filter** saves memory.
1. **Add URL**: Hashes the URL to 3 positions and flips bits to 1 (Green).
2. **Check URL**: Hashes the URL. If **ALL** corresponding bits are 1, it says "Probably Present".
3. **Collision**: Try adding "google.com" then checking "yahoo.com". If their bits overlap, you might get a False Positive (rare with large arrays).
Bit Array (Size: 20)
System Ready.
## 13. Summary (Whiteboard)
Scale & Capacity
1 Billion Pages/Month
Politeness is the limiter (not CPU)
Async DNS is critical
Core Algorithms
Bloom Filter: Deduplication (RAM efficient)
SimHash: Content Similarity
Politeness Router: Queue Mapping
Storage
BigTable/HBase: Wide Column Store
Row Key: Reverse URL (`com.google.www`)
Col: Content, Metadata
Reliability
Spider Traps: Max Depth check
Robots.txt: Respect rules
Dynamic Content: Headless Browser
Found this lesson helpful?
Mark it as mastered to track your progress through the course.