Web Crawler
A crawler ingests the open web at billions of pages, respects every site's robots policy, deduplicates URLs and content, renders JavaScript when required, and persists to a downstream index. The dominant constraints are politeness (do not hammer one host), frontier management (which URL next), dedup at scale (Bloom filters across distributed shards), and headless rendering cost (browser farms are 100× the price of HTTP fetch).
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Pages to crawl | ~50 B | open web indexable |
| Re-crawl cycle | 2–30 days | by importance |
| Fetch rate | ~50 K pages/s | steady-state |
| Avg page size | ~80 KB | text + minimal assets |
| Bandwidth | ~32 Gbps | 50 K × 80 KB |
| Storage / pass | ~4 PB | compressed WARC |
| URL dedup memory | ~80 GB | Bloom filter, 1011 URLs, 1% FPR |
Politeness
The first rule: do not melt other people's sites. The crawler must:
- Fetch and respect
/robots.txtper host. Cache for 24 h. HonorDisallow,Crawl-delay, andSitemapdirectives. - Limit concurrency per host — one in flight per IP is the safe default; 1 req per crawl-delay seconds otherwise.
- Set a unique
User-Agentwith a contact URL so site owners can reach you. - Honor 429/503 with
Retry-After, exponential backoff per host on errors, and immediate stop on persistent 5xx (their site is dying).
Politeness violations get you IP-banned, then ASN-banned, then a public shaming. Major search-engine crawlers operate from rotating but well-known IP ranges with reverse-DNS verification.
URL Frontier
The frontier holds all URLs awaiting fetch, sorted by priority and constrained by politeness. The classic Mercator design (Heydon & Najork) uses two-tier queues:
- Priority queue (front-end) — F front queues, each accepting URLs at a priority level (1..F). Producer hashes URL into one queue based on importance heuristic (PageRank, freshness target, depth).
- Politeness queue (back-end) — B back queues, each dedicated to a single host. The fetcher pulls from a back queue, processes the URL, the host is unlocked after the polite delay.
- Heap of next-fire times — pick the back queue whose host's polite delay has elapsed.
Sizing: front queues 1–10, back queues 100K–1M (one per active host). Persist the frontier to RocksDB / Postgres so a crash does not lose progress.
Distributed Dedup
"Have I seen this URL?" must be O(1) and survive across worker restarts. Three layers:
- Bloom filter (in-memory) — O(1), space-efficient, <1% false positive rate at the cost of occasional re-fetch. Sized for the URL space you want to remember (1010 URLs × ~10 bits each = ~12 GB per filter). Sharded by host hash so each crawler has its slice.
- Persistent KV (RocksDB / Cassandra) — durable record of
(url_hash, last_crawled, etag)for re-crawl decisions. - Content dedup — even after URL dedup, two URLs can serve identical content (canonical, mobile, tracked). Hash the normalized response body (SimHash) and skip near-duplicates.
Routing across the cluster: hash URL by host so all URLs of one host land on one node. This colocates the politeness queue with its dedup state and avoids cross-shard chatter.
Headless Rendering
Modern web pages are JavaScript apps; raw HTML is often empty. Render via a headless browser farm (Chromium / Playwright / Puppeteer):
- Page-load timeout 10–30 s, abort on hang.
- Block third-party trackers, ads, fonts — you want content, not the marketing payload. Saves 70% of bandwidth.
- Run multiple isolated profiles per host with rotating proxies; expect anti-bot defenses (Cloudflare, Akamai bot manager) to challenge.
- Cost: a headless render is ~100× the CPU and memory of an HTTP fetch. Use only when raw HTML is empty or marked SPA.
Triage: HTTP fetch first; if the response body has JS-mostly content (low text/markup ratio, missing extracted links), promote to render queue. Most pages do not need rendering; saving the 99% you can fetch raw is the difference between a $1M and $100M crawler.
The Common Crawl Approach
Common Crawl publishes a monthly >3 PB snapshot of the public web in WARC format (Web ARChive: a concatenation of HTTP request/response records). Their pipeline is instructive:
- Apache Nutch generates the frontier and dispatches fetches.
- Output is grouped into 1 GB WARC files, stored in S3 (paid by AWS Open Data).
- An index (Parquet on S3 + Athena) maps URL prefix and timestamp to byte ranges in the WARC, so consumers can pull just the records they need.
- Re-crawl strategy: tier sites by traffic and expected freshness; re-visit news daily, long-tail every few months.
For your own crawler: you almost certainly want to start with Common Crawl rather than re-fetch the open web. Then crawl deltas and the niche your business cares about.
URL Canonicalization
Before dedup, normalize: lowercase scheme and host, drop fragment (#anchor), sort query params, drop tracking params (utm_*, fbclid), strip default ports, resolve relative URLs against the base. Without this, HTTPS://Example.com:443/path?b=2&a=1#x and https://example.com/path?a=1&b=2 are different URLs and you crawl both.
Honor <link rel="canonical"> tags — the page tells you its preferred URL. Treat duplicates as references to the canonical, not as separate pages.
Failure Modes
- Spider trap — a site generates infinite URLs (calendar with /day/{n}). Defenses: per-host URL count cap, depth limit, regex pattern detection, tarpit blacklist.
- Politeness violation cascade — a misconfiguration sends 1000 RPS to one host. Hard rate limit upstream of the fetcher, monitored.
- Bloom filter saturation — FPR climbs as you cram more URLs in than sized for. Periodically swap to a fresh, larger filter (or use a counting Bloom for online resizing).
- WARC corruption — partial writes on crash. Use the WARC file's offset-based recovery and write atomically (temp file + rename).
FAQ
Should I render every page?
No — 95%+ of pages have content in the HTML. Render only when the HTML is empty or links are missing (heuristic: less than 200 words extracted, fewer than 3 outgoing links).
How do you prioritize re-crawls?
Page-rank weighted recency model: high-rank pages re-crawl daily, long-tail every 30 days. Boost on detected change (RSS, sitemap lastmod, content hash diff).
What about HTTPS and certificate issues?
Validate certs in production; record cert errors as a soft signal (often a real error, sometimes site neglect). Do not bypass; a "skip cert check" mode is how you get phished.
Storage: WARC vs custom format?
WARC is the standard, parseable by every search-tool, supported by Common Crawl tooling. Use it unless you have a strong reason; custom formats become a tax on your future self.