Design a Web Crawler

[!IMPORTANT] In this lesson, you will master:

1. What is a Web Crawler?: Building intuition behind 1. what is a web crawler?.

2. Requirements & Goals: Building intuition behind 2. requirements & goals.

3. Capacity Estimation: Building intuition behind 3. capacity estimation.

1. What is a Web Crawler?

A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.

Use Cases

Search Engine Indexing: Google, Bing.
Web Archiving: Wayback Machine.
Web Mining: Financial analysis, Copyright checks, LLM Training Data.

Try it yourself: Open robots.txt of any major site (e.g., google.com/robots.txt). See the Disallow rules. This is what the crawler must respect.

2. Requirements & Goals

Functional Requirements

Crawl: Fetch content from a given URL.
Extract: Parse HTML and find new links (<a href>).
Store: Save the content for the Indexer.

Non-Functional Requirements

Scalability: The web is huge (> billions of pages).
Politeness: DO NOT DDoS the target servers. Respect robots.txt.
Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
Extensibility: Support new content types (PDF, Images).

3. Capacity Estimation

Let’s design for 1 Billion Pages per Month.

Throughput (QPS)

1 Billion / 30 days / 24 hours / 3600 sec ≈ 400 Pages/sec.
Peak traffic: 2× = 800 QPS.
This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).

Storage

Avg page size: 100 KB (HTML only).
Storage = 1B × 100 KB = 100 TB / month.
In 5 years: 6 PB. (We need BigTable / HBase).

4. System APIs

The Crawler is an internal system, but the components communicate via RPC (gRPC).

4.1 Frontier Service (Task Queue)

service Frontier {
  // Workers ask for a batch of URLs to crawl
  rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);

  // Workers report found links
  rpc ReportLinks(ReportRequest) returns (Ack);
}

message CrawlTask {
  string url = 1;
  string domain = 2;
  int32 depth = 3;
}

4.2 HTML Fetcher Service

POST /fetch
{
  "url": "https://example.com",
  "timeout": 5000,
  "follow_redirects": true
}

Response:

{
  "status": 200,
  "content_type": "text/html",
  "body_hash": "abc12345",
  "content": "<html>...</html>"
}

5. Database Design

1. URL Frontier (Queue System)

Stores URLs to be visited. Needs priority and politeness logic.

Redis / Kafka: Fast, in-memory queues.
Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).

2. Content Storage (NoSQL)

Stores the raw HTML.

BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
Schema: RowKey: URL_Hash, Col: HTML_Content, Col: Metadata, Col: SimHash.

3. Duplicate Detection

We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.

Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.

6. High-Level Design

Web Crawler Pipeline Architecture.

System Architecture: Web Crawler (Spider)

URL Frontier | Async DNS | Distributed Fetchers | Deduplication

Fetch Flow

Extraction Loop

DNS/Checks

URL Frontier

Workers & Pipeline

Storage Layer

Seed URLs

Front Queues (F1..Fn)

Prioritizer (PageRank, Relevance)

Back Queues (B1..Bn)

Politeness Router (1 Queue per Host)

DNS Resolver

Async Custom Cache

HTML Downloader

• HTTP Client • Robots.txt Check • Distributed Workers

Content Parser

• Validation • Link Extraction • De-dup (SimHash)

Bloom Filter / Url Seen Test

BigTable / HBase

Col: HTML Content Col: Metadata Row: URL Hash

1. **Seed URLs**: Manually curated list (cnn.com, wikipedia.org). 2. **URL Frontier**: Distributes URLs to workers (Prioritizer & Politeness). 3. **HTML Downloader**: Fetches pages using **Async DNS**. 4. **Content Parser**: Validates HTML, extracts links. 5. **Duplicate Eliminator**: Checks Bloom Filter and SimHash (Content Dedup). 6. **Storage**: Saves content to BigTable/HBase. 7. **Loop**: New links go back to Frontier. --- ## 7. Component Design (Deep Dive) ### The URL Frontier (Priority vs Politeness) Visualized in the **URL Frontier** zone of the diagram. This is the most complex part. It has two main buffers: 1. **Front Queue (Prioritizer)**: * Prioritize URLs based on PageRank or change frequency. * High Priority Q, Medium Q, Low Q. 2. **Back Queue (Politeness Router)**: * **The Rule**: Do not hit `wikipedia.org` more than once per second. * We map `wikipedia.org` to `Queue #5`. * A "Queue Router" ensures that `Queue #5` is only consumed by one worker thread, enforcing the delay. * This implements a **Rate Limiter**. See [Module 10: Simple Services (Rate Limiter)](/system-design/10-simple-services/03-api-rate-limiter/). ### Robot Exclusion Standard (Ethics) Before crawling `example.com`, we fetch `example.com/robots.txt`. * **Rules**: `User-agent: * Disallow: /private/`. * **Legal**: While not a law, ignoring it gets you IP banned and sued. * **Implementation**: We cache the `Robots.txt` for every domain in a KV Store (Redis) with a TTL. ### Bloom Filters (Mathematical Magic) A Bloom Filter is a bit array of size m with k hash functions. * **Add Item**: Hash it k times, set bits to 1. * **Check Item**: Hash it k times. If *all* bits are 1, it *might* exist. If *any* bit is 0, it *definitely* does not exist. * **False Positive**: We might think we visited a page when we haven't. This is acceptable for a crawler (we just miss one update). * **Space Savings**: Stores billions of URLs in RAM. ### DNS Resolver Caching (Why Custom?) Visualized as **DNS Resolver** in the diagram. DNS lookups take 10ms - 200ms. If we crawl 100 pages from `nytimes.com`, we don't want 100 DNS requests. * **The OS Problem**: Standard libc `getaddrinfo` is **Blocking**. If you have 1000 threads, and 500 are waiting on DNS, your CPU is idle. * **The Solution**: Build a custom **Async DNS Resolver** over UDP. We fire a UDP packet, register a callback, and move on. No thread blocking. * **Caching**: We cache `hostname → IP` in memory. ### Content Deduplication (SimHash) How do we know if two pages are 99% similar (Near-Duplicate)? * **Exact Match (MD5)**: Useless. Changing one character ("the" → "teh") changes the entire hash. * : 1. **Tokenize**: Break text into words/features. 2. **Hash**: Hash each word into 64-bits. 3. **Vector Sum**: Start with a vector `V = [0,0...0]`. For each bit in the word hash: if 1, add weight to `V[i]`; if 0, subtract weight. 4. **Fingerprint**: If `V[i] > 0`, set final bit to `1`. Else `0`. * **Hamming Distance**: If the SimHash of Doc A and Doc B differ by only 2 or 3 bits, they are near-duplicates. We discard the new one. --- ## 8. Requirements Traceability

Requirement	Design Decision	Justification
Politeness	Politeness Queue / Router	Ensures we never exceed QPS limits for a target domain.
Throughput	Async DNS Resolver	Prevents thread blocking during DNS lookups, increasing concurrency.
Scalability	Bloom Filter	Tracks billions of visited URLs in RAM to prevent cycles.
Storage	BigTable / HBase	NoSQL Wide Column Store handles massive write throughput and PB of data.
Freshness	Priority Queue (PageRank)	Ensures important pages (high PageRank) are crawled more frequently.

--- ## 9. Observability & Metrics We need to ensure the crawler isn't stuck or banned. ### Key Metrics * **Crawl Rate**: Pages/sec per worker. * **DNS Latency**: Avg time to resolve IP. Spike indicates DNS throttling. * **Error Rate**: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy). * **Queue Depth**: If Front Queue grows but Back Queue is empty, the Router is slow. ### Trap Detection > [!NOTE] > **War Story: The Spider Trap Loop** > In the early days, a major search engine crawler encountered dynamically generated calendar links (where "Next Month" produced a new unique URL infinitely). This created an infinite loop, wasting massive bandwidth and storage. They solved this by implementing strict path-depth limits (`max_depth = 15`) and detecting repeating path segments to dynamically prune the crawl graph. * **Path Depth**: Alert if `url_depth > 20`. * **Domain Concentration**: Alert if 50% of traffic is hitting one domain (Spider Trap). --- ## 10. Deployment Strategy ### Rolling Updates We have thousands of worker nodes. 1. **Drain**: Stop assigning new tasks to Worker A. 2. **Wait**: Wait for active crawls to finish (graceful shutdown). 3. **Update**: Deploy new binary. 4. **Resume**: Re-register with Frontier. ### Resiliency If a worker crashes: * The Frontier detects heartbeat failure. * The URLs assigned to that worker are re-queued to another worker. --- ## 11. Interview Gauntlet ### Rapid Fire Questions 1. **How do you handle 'robots.txt'?** We cache it for every domain. Before crawling, we check the rules. We respect `User-agent: * Disallow: /admin`. 2. **What if the page renders with JavaScript (React/Angular)?** Standard HTTP Get returns empty HTML. We need a "Headless Browser" (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive. 3. **How do you detect infinite loops?** Bloom Filters for visited URLs and Max Depth limit. 4. **Why not use a standard Hash Set instead of Bloom Filter?** 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB. 5. **How do you update 'SimHash' efficiently?** We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle). --- ## 12. Interactive Decision Visualizer: The Bloom Filter This simulator demonstrates how a **Bloom Filter** saves memory. 1. **Add URL**: Hashes the URL to 3 positions and flips bits to 1 (Green). 2. **Check URL**: Hashes the URL. If **ALL** corresponding bits are 1, it says "Probably Present". 3. **Collision**: Try adding "google.com" then checking "yahoo.com". If their bits overlap, you might get a False Positive (rare with large arrays).

Bit Array (Size: 20)

      System Ready.
  

## 13. Summary (Whiteboard)

Scale & Capacity

1 Billion Pages/Month
Politeness is the limiter (not CPU)
Async DNS is critical

Core Algorithms

Bloom Filter: Deduplication (RAM efficient)
SimHash: Content Similarity
Politeness Router: Queue Mapping

Storage

BigTable/HBase: Wide Column Store
Row Key: Reverse URL (`com.google.www`)
Col: Content, Metadata

Reliability

Spider Traps: Max Depth check
Robots.txt: Respect rules
Dynamic Content: Headless Browser

Design a Web Crawler

Design a Web Crawler

1. What is a Web Crawler?

Use Cases

2. Requirements & Goals

Functional Requirements

Non-Functional Requirements

3. Capacity Estimation

Throughput (QPS)

Storage

4. System APIs

4.1 Frontier Service (Task Queue)

4.2 HTML Fetcher Service

5. Database Design

1. URL Frontier (Queue System)

2. Content Storage (NoSQL)

3. Duplicate Detection

6. High-Level Design

Scale & Capacity

Core Algorithms

Storage

Reliability

Found this lesson helpful?