Design a Web Crawler (Googlebot)
1. What is a Web Crawler?
A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.
Use Cases
- Search Engine Indexing: Google, Bing.
- Web Archiving: Wayback Machine.
- Web Mining: Financial analysis, Copyright checks, LLM Training Data.
Try it yourself: Open
robots.txtof any major site (e.g.,google.com/robots.txt). See theDisallowrules. This is what the crawler must respect.
2. Requirements & Goals
Functional Requirements
- Crawl: Fetch content from a given URL.
- Extract: Parse HTML and find new links (
<a href>). - Store: Save the content for the Indexer.
Non-Functional Requirements
- Scalability: The web is huge (> billions of pages).
- Politeness: DO NOT DDoS the target servers. Respect
robots.txt. - Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
- Extensibility: Support new content types (PDF, Images).
3. Capacity Estimation
Let’s design for 1 Billion Pages per Month.
Throughput (QPS)
- 1 Billion / 30 days / 24 hours / 3600 sec ≈ 400 Pages/sec.
- Peak traffic: 2× = 800 QPS.
- This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).
Storage
- Avg page size: 100 KB (HTML only).
- Storage = 1B × 100 KB = 100 TB / month.
- In 5 years: 6 PB. (We need BigTable / HBase).
4. System APIs
The Crawler is an internal system, but the components communicate via RPC (gRPC).
4.1 Frontier Service (Task Queue)
service Frontier {
// Workers ask for a batch of URLs to crawl
rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);
// Workers report found links
rpc ReportLinks(ReportRequest) returns (Ack);
}
message CrawlTask {
string url = 1;
string domain = 2;
int32 depth = 3;
}
4.2 HTML Fetcher Service
POST /fetch
{
"url": "https://example.com",
"timeout": 5000,
"follow_redirects": true
}
Response:
{
"status": 200,
"content_type": "text/html",
"body_hash": "abc12345",
"content": "<html>...</html>"
}
5. Database Design
1. URL Frontier (Queue System)
Stores URLs to be visited. Needs priority and politeness logic.
- Redis / Kafka: Fast, in-memory queues.
- Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).
2. Content Storage (NoSQL)
Stores the raw HTML.
- BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
- Schema:
RowKey: URL_Hash,Col: HTML_Content,Col: Metadata,Col: SimHash.
3. Duplicate Detection
We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.
- Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.
6. High-Level Design
Web Crawler Pipeline Architecture.
(PageRank, Relevance)
(1 Queue per Host)
• Robots.txt Check
• Distributed Workers
• Link Extraction
• De-dup (SimHash)
Col: Metadata
Row: URL Hash
- Seed URLs: Manually curated list (cnn.com, wikipedia.org).
- URL Frontier: Distributes URLs to workers (Prioritizer & Politeness).
- HTML Downloader: Fetches pages using Async DNS.
- Content Parser: Validates HTML, extracts links.
- Duplicate Eliminator: Checks Bloom Filter and SimHash (Content Dedup).
- Storage: Saves content to BigTable/HBase.
- Loop: New links go back to Frontier.
7. Component Design (Deep Dive)
The URL Frontier (Priority vs Politeness)
Visualized in the URL Frontier zone of the diagram. This is the most complex part. It has two main buffers:
- Front Queue (Prioritizer):
- Prioritize URLs based on PageRank or change frequency.
- High Priority Q, Medium Q, Low Q.
- Back Queue (Politeness Router):
- The Rule: Do not hit
wikipedia.orgmore than once per second. - We map
wikipedia.orgtoQueue #5. - A “Queue Router” ensures that
Queue #5is only consumed by one worker thread, enforcing the delay. - This implements a Rate Limiter. See Module 10: Simple Services (Rate Limiter).
- The Rule: Do not hit
Bloom Filters (Mathematical Magic)
A Bloom Filter is a bit array of size m with k hash functions.
- Add Item: Hash it k times, set bits to 1.
- Check Item: Hash it k times. If all bits are 1, it might exist. If any bit is 0, it definitely does not exist.
- False Positive: We might think we visited a page when we haven’t. This is acceptable for a crawler (we just miss one update).
- Space Savings: Stores billions of URLs in RAM.
DNS Resolver Caching
Visualized as DNS Resolver in the diagram.
DNS lookups take 10ms - 200ms. If we crawl 100 pages from nytimes.com, we don’t want 100 DNS requests.
- Problem: Standard OS DNS (
getaddrinfo) is Synchronous and blocks the thread. With thousands of threads, this becomes a major bottleneck. - Solution: We build a custom Asynchronous DNS Resolver (using UDP directly) and maintain a custom DNS Cache (Map:
hostname -> IP) inside the crawler. - We refresh it periodically (TTL).
Content Deduplication (SimHash)
What if two URLs have different links but identical content?
- Exact Match: MD5(Content). Fails if one byte differs (e.g., timestamp).
- Near-Duplicate (SimHash): A fingerprinting algorithm where similar documents have similar hashes (small Hamming Distance). We use this to discard duplicate content.
8. Requirements Traceability
| Requirement | Design Decision | Justification |
|---|---|---|
| Politeness | Politeness Queue / Router | Ensures we never exceed QPS limits for a target domain. |
| Throughput | Async DNS Resolver | Prevents thread blocking during DNS lookups, increasing concurrency. |
| Scalability | Bloom Filter | Tracks billions of visited URLs in RAM to prevent cycles. |
| Storage | BigTable / HBase | NoSQL Wide Column Store handles massive write throughput and PB of data. |
| Freshness | Priority Queue (PageRank) | Ensures important pages (high PageRank) are crawled more frequently. |
9. Observability & Metrics
We need to ensure the crawler isn’t stuck or banned.
Key Metrics
- Crawl Rate: Pages/sec per worker.
- DNS Latency: Avg time to resolve IP. Spike indicates DNS throttling.
- Error Rate: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy).
- Queue Depth: If Front Queue grows but Back Queue is empty, the Router is slow.
Trap Detection
- Path Depth: Alert if
url_depth > 20. - Domain Concentration: Alert if 50% of traffic is hitting one domain (Spider Trap).
10. Deployment Strategy
Rolling Updates
We have thousands of worker nodes.
- Drain: Stop assigning new tasks to Worker A.
- Wait: Wait for active crawls to finish (graceful shutdown).
- Update: Deploy new binary.
- Resume: Re-register with Frontier.
Resiliency
If a worker crashes:
- The Frontier detects heartbeat failure.
- The URLs assigned to that worker are re-queued to another worker.
11. Interview Gauntlet
Rapid Fire Questions
- How do you handle ‘robots.txt’? We cache it for every domain. Before crawling, we check the rules. We respect
User-agent: * Disallow: /admin. - What if the page renders with JavaScript (React/Angular)? Standard HTTP Get returns empty HTML. We need a “Headless Browser” (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive.
- How do you detect infinite loops? Bloom Filters for visited URLs and Max Depth limit.
- Why not use a standard Hash Set instead of Bloom Filter? 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB.
- How do you update ‘SimHash’ efficiently? We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle).
12. Interactive Decision Visualizer: Frontier & Deduplication
See how the URL Frontier processes incoming links.
- Bloom Filter: Checks if URL was already visited (Red flash).
- Router: Sorts URLs into queues based on Domain.
- Politeness: Ensures we wait before hitting the same domain again (Yellow wait state).
Extracted Links
Politeness Queues (1s Delay)
13. Summary (Whiteboard)
Scale & Capacity
- 1 Billion Pages/Month
- Politeness is the limiter (not CPU)
- Async DNS is critical
Core Algorithms
- Bloom Filter: Deduplication (RAM efficient)
- SimHash: Content Similarity
- Politeness Router: Queue Mapping
Storage
- BigTable/HBase: Wide Column Store
- Row Key: Reverse URL (`com.google.www`)
- Col: Content, Metadata
Reliability
- Spider Traps: Max Depth check
- Robots.txt: Respect rules
- Dynamic Content: Headless Browser