Design a Web Crawler

[!IMPORTANT] In this lesson, you will master:

  1. 1. What is a Web Crawler?: Building intuition behind 1. what is a web crawler?.
  2. 2. Requirements & Goals: Building intuition behind 2. requirements & goals.
  3. 3. Capacity Estimation: Building intuition behind 3. capacity estimation.

1. What is a Web Crawler?

A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.

Use Cases

  • Search Engine Indexing: Google, Bing.
  • Web Archiving: Wayback Machine.
  • Web Mining: Financial analysis, Copyright checks, LLM Training Data.

Try it yourself: Open robots.txt of any major site (e.g., google.com/robots.txt). See the Disallow rules. This is what the crawler must respect.


2. Requirements & Goals

Functional Requirements

  • Crawl: Fetch content from a given URL.
  • Extract: Parse HTML and find new links (<a href>).
  • Store: Save the content for the Indexer.

Non-Functional Requirements

  • Scalability: The web is huge (> billions of pages).
  • Politeness: DO NOT DDoS the target servers. Respect robots.txt.
  • Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
  • Extensibility: Support new content types (PDF, Images).

3. Capacity Estimation

Let’s design for 1 Billion Pages per Month.

Throughput (QPS)

  • 1 Billion / 30 days / 24 hours / 3600 sec ≈ 400 Pages/sec.
  • Peak traffic: 2× = 800 QPS.
  • This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).

Storage

  • Avg page size: 100 KB (HTML only).
  • Storage = 1B × 100 KB = 100 TB / month.
  • In 5 years: 6 PB. (We need BigTable / HBase).

4. System APIs

The Crawler is an internal system, but the components communicate via RPC (gRPC).

4.1 Frontier Service (Task Queue)

service Frontier {
  // Workers ask for a batch of URLs to crawl
  rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);

  // Workers report found links
  rpc ReportLinks(ReportRequest) returns (Ack);
}

message CrawlTask {
  string url = 1;
  string domain = 2;
  int32 depth = 3;
}

4.2 HTML Fetcher Service

POST /fetch
{
  "url": "https://example.com",
  "timeout": 5000,
  "follow_redirects": true
}

Response:

{
  "status": 200,
  "content_type": "text/html",
  "body_hash": "abc12345",
  "content": "<html>...</html>"
}

5. Database Design

1. URL Frontier (Queue System)

Stores URLs to be visited. Needs priority and politeness logic.

  • Redis / Kafka: Fast, in-memory queues.
  • Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).

2. Content Storage (NoSQL)

Stores the raw HTML.

  • BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
  • Schema: RowKey: URL_Hash, Col: HTML_Content, Col: Metadata, Col: SimHash.

3. Duplicate Detection

We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.

  • Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.

6. High-Level Design

Web Crawler Pipeline Architecture.

System Architecture: Web Crawler (Spider)
URL Frontier | Async DNS | Distributed Fetchers | Deduplication
Fetch Flow
Extraction Loop
DNS/Checks
URL Frontier
Workers & Pipeline
Storage Layer
Seed URLs
Front Queues (F1..Fn)
Prioritizer (PageRank, Relevance)
Back Queues (B1..Bn)
Politeness Router (1 Queue per Host)
DNS Resolver
Async Custom Cache
HTML Downloader
• HTTP Client • Robots.txt Check • Distributed Workers
Content Parser
• Validation • Link Extraction • De-dup (SimHash)
Bloom Filter / Url Seen Test
BigTable / HBase
Col: HTML Content Col: Metadata Row: URL Hash
Mapping Logic Fetch Task Resolve IP Raw HTML Store Content Extracted Links New URLs (Feedback Loop)
1. **Seed URLs**: Manually curated list (cnn.com, wikipedia.org). 2. **URL Frontier**: Distributes URLs to workers (Prioritizer & Politeness). 3. **HTML Downloader**: Fetches pages using **Async DNS**. 4. **Content Parser**: Validates HTML, extracts links. 5. **Duplicate Eliminator**: Checks Bloom Filter and SimHash (Content Dedup). 6. **Storage**: Saves content to BigTable/HBase. 7. **Loop**: New links go back to Frontier. --- ## 7. Component Design (Deep Dive) ### The URL Frontier (Priority vs Politeness) Visualized in the **URL Frontier** zone of the diagram. This is the most complex part. It has two main buffers: 1. **Front Queue (Prioritizer)**: * Prioritize URLs based on PageRank or change frequency. * High Priority Q, Medium Q, Low Q. 2. **Back Queue (Politeness Router)**: * **The Rule**: Do not hit `wikipedia.org` more than once per second. * We map `wikipedia.org` to `Queue #5`. * A "Queue Router" ensures that `Queue #5` is only consumed by one worker thread, enforcing the delay. * This implements a **Rate Limiter**. See [Module 10: Simple Services (Rate Limiter)](/system-design/10-simple-services/03-api-rate-limiter/). ### Robot Exclusion Standard (Ethics) Before crawling `example.com`, we fetch `example.com/robots.txt`. * **Rules**: `User-agent: * Disallow: /private/`. * **Legal**: While not a law, ignoring it gets you IP banned and sued. * **Implementation**: We cache the `Robots.txt` for every domain in a KV Store (Redis) with a TTL. ### Bloom Filters (Mathematical Magic) A Bloom Filter is a bit array of size m with k hash functions. * **Add Item**: Hash it k times, set bits to 1. * **Check Item**: Hash it k times. If *all* bits are 1, it *might* exist. If *any* bit is 0, it *definitely* does not exist. * **False Positive**: We might think we visited a page when we haven't. This is acceptable for a crawler (we just miss one update). * **Space Savings**: Stores billions of URLs in RAM. ### DNS Resolver Caching (Why Custom?) Visualized as **DNS Resolver** in the diagram. DNS lookups take 10ms - 200ms. If we crawl 100 pages from `nytimes.com`, we don't want 100 DNS requests. * **The OS Problem**: Standard libc `getaddrinfo` is **Blocking**. If you have 1000 threads, and 500 are waiting on DNS, your CPU is idle. * **The Solution**: Build a custom **Async DNS Resolver** over UDP. We fire a UDP packet, register a callback, and move on. No thread blocking. * **Caching**: We cache `hostname → IP` in memory. ### Content Deduplication (SimHash) How do we know if two pages are 99% similar (Near-Duplicate)? * **Exact Match (MD5)**: Useless. Changing one character ("the" → "teh") changes the entire hash. * SimHash Algorithm: 1. **Tokenize**: Break text into words/features. 2. **Hash**: Hash each word into 64-bits. 3. **Vector Sum**: Start with a vector `V = [0,0...0]`. For each bit in the word hash: if 1, add weight to `V[i]`; if 0, subtract weight. 4. **Fingerprint**: If `V[i] > 0`, set final bit to `1`. Else `0`. * **Hamming Distance**: If the SimHash of Doc A and Doc B differ by only 2 or 3 bits, they are near-duplicates. We discard the new one. --- ## 8. Requirements Traceability
Requirement Design Decision Justification
Politeness Politeness Queue / Router Ensures we never exceed QPS limits for a target domain.
Throughput Async DNS Resolver Prevents thread blocking during DNS lookups, increasing concurrency.
Scalability Bloom Filter Tracks billions of visited URLs in RAM to prevent cycles.
Storage BigTable / HBase NoSQL Wide Column Store handles massive write throughput and PB of data.
Freshness Priority Queue (PageRank) Ensures important pages (high PageRank) are crawled more frequently.
--- ## 9. Observability & Metrics We need to ensure the crawler isn't stuck or banned. ### Key Metrics * **Crawl Rate**: Pages/sec per worker. * **DNS Latency**: Avg time to resolve IP. Spike indicates DNS throttling. * **Error Rate**: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy). * **Queue Depth**: If Front Queue grows but Back Queue is empty, the Router is slow. ### Trap Detection > [!NOTE] > **War Story: The Spider Trap Loop** > In the early days, a major search engine crawler encountered dynamically generated calendar links (where "Next Month" produced a new unique URL infinitely). This created an infinite loop, wasting massive bandwidth and storage. They solved this by implementing strict path-depth limits (`max_depth = 15`) and detecting repeating path segments to dynamically prune the crawl graph. * **Path Depth**: Alert if `url_depth > 20`. * **Domain Concentration**: Alert if 50% of traffic is hitting one domain (Spider Trap). --- ## 10. Deployment Strategy ### Rolling Updates We have thousands of worker nodes. 1. **Drain**: Stop assigning new tasks to Worker A. 2. **Wait**: Wait for active crawls to finish (graceful shutdown). 3. **Update**: Deploy new binary. 4. **Resume**: Re-register with Frontier. ### Resiliency If a worker crashes: * The Frontier detects heartbeat failure. * The URLs assigned to that worker are re-queued to another worker. --- ## 11. Interview Gauntlet ### Rapid Fire Questions 1. **How do you handle 'robots.txt'?** We cache it for every domain. Before crawling, we check the rules. We respect `User-agent: * Disallow: /admin`. 2. **What if the page renders with JavaScript (React/Angular)?** Standard HTTP Get returns empty HTML. We need a "Headless Browser" (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive. 3. **How do you detect infinite loops?** Bloom Filters for visited URLs and Max Depth limit. 4. **Why not use a standard Hash Set instead of Bloom Filter?** 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB. 5. **How do you update 'SimHash' efficiently?** We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle). --- ## 12. Interactive Decision Visualizer: The Bloom Filter This simulator demonstrates how a **Bloom Filter** saves memory. 1. **Add URL**: Hashes the URL to 3 positions and flips bits to 1 (Green). 2. **Check URL**: Hashes the URL. If **ALL** corresponding bits are 1, it says "Probably Present". 3. **Collision**: Try adding "google.com" then checking "yahoo.com". If their bits overlap, you might get a False Positive (rare with large arrays).
Bit Array (Size: 20)
System Ready.
## 13. Summary (Whiteboard)

Scale & Capacity

  • 1 Billion Pages/Month
  • Politeness is the limiter (not CPU)
  • Async DNS is critical

Core Algorithms

  • Bloom Filter: Deduplication (RAM efficient)
  • SimHash: Content Similarity
  • Politeness Router: Queue Mapping

Storage

  • BigTable/HBase: Wide Column Store
  • Row Key: Reverse URL (`com.google.www`)
  • Col: Content, Metadata

Reliability

  • Spider Traps: Max Depth check
  • Robots.txt: Respect rules
  • Dynamic Content: Headless Browser