Web Crawler — Politeness, frontiers, dedup, headless rendering

Architecture

Capacity Estimation

Metric	Value	Notes
Pages to crawl	~50 B	open web indexable
Re-crawl cycle	2–30 days	by importance
Fetch rate	~50 K pages/s	steady-state
Avg page size	~80 KB	text + minimal assets
Bandwidth	~32 Gbps	50 K × 80 KB
Storage / pass	~4 PB	compressed WARC
URL dedup memory	~80 GB	Bloom filter, 10¹¹ URLs, 1% FPR

Politeness

The first rule: do not melt other people's sites. The crawler must:

Fetch and respect /robots.txt per host. Cache for 24 h. Honor Disallow, Crawl-delay, and Sitemap directives.
Limit concurrency per host — one in flight per IP is the safe default; 1 req per crawl-delay seconds otherwise.
Set a unique User-Agent with a contact URL so site owners can reach you.
Honor 429/503 with Retry-After, exponential backoff per host on errors, and immediate stop on persistent 5xx (their site is dying).

Politeness violations get you IP-banned, then ASN-banned, then a public shaming. Major search-engine crawlers operate from rotating but well-known IP ranges with reverse-DNS verification.

URL Frontier

The frontier holds all URLs awaiting fetch, sorted by priority and constrained by politeness. The classic Mercator design (Heydon & Najork) uses two-tier queues:

Priority queue (front-end) — F front queues, each accepting URLs at a priority level (1..F). Producer hashes URL into one queue based on importance heuristic (PageRank, freshness target, depth).
Politeness queue (back-end) — B back queues, each dedicated to a single host. The fetcher pulls from a back queue, processes the URL, the host is unlocked after the polite delay.
Heap of next-fire times — pick the back queue whose host's polite delay has elapsed.

Sizing: front queues 1–10, back queues 100K–1M (one per active host). Persist the frontier to RocksDB / Postgres so a crash does not lose progress.

Distributed Dedup

"Have I seen this URL?" must be O(1) and survive across worker restarts. Three layers:

Bloom filter (in-memory) — O(1), space-efficient, <1% false positive rate at the cost of occasional re-fetch. Sized for the URL space you want to remember (10¹⁰ URLs × ~10 bits each = ~12 GB per filter). Sharded by host hash so each crawler has its slice.
Persistent KV (RocksDB / Cassandra) — durable record of (url_hash, last_crawled, etag) for re-crawl decisions.
Content dedup — even after URL dedup, two URLs can serve identical content (canonical, mobile, tracked). Hash the normalized response body (SimHash) and skip near-duplicates.

Routing across the cluster: hash URL by host so all URLs of one host land on one node. This colocates the politeness queue with its dedup state and avoids cross-shard chatter.

Headless Rendering

Modern web pages are JavaScript apps; raw HTML is often empty. Render via a headless browser farm (Chromium / Playwright / Puppeteer):

Page-load timeout 10–30 s, abort on hang.
Block third-party trackers, ads, fonts — you want content, not the marketing payload. Saves 70% of bandwidth.
Run multiple isolated profiles per host with rotating proxies; expect anti-bot defenses (Cloudflare, Akamai bot manager) to challenge.
Cost: a headless render is ~100× the CPU and memory of an HTTP fetch. Use only when raw HTML is empty or marked SPA.

Triage: HTTP fetch first; if the response body has JS-mostly content (low text/markup ratio, missing extracted links), promote to render queue. Most pages do not need rendering; saving the 99% you can fetch raw is the difference between a $1M and $100M crawler.

The Common Crawl Approach

Common Crawl publishes a monthly >3 PB snapshot of the public web in WARC format (Web ARChive: a concatenation of HTTP request/response records). Their pipeline is instructive:

Apache Nutch generates the frontier and dispatches fetches.
Output is grouped into 1 GB WARC files, stored in S3 (paid by AWS Open Data).
An index (Parquet on S3 + Athena) maps URL prefix and timestamp to byte ranges in the WARC, so consumers can pull just the records they need.
Re-crawl strategy: tier sites by traffic and expected freshness; re-visit news daily, long-tail every few months.

For your own crawler: you almost certainly want to start with Common Crawl rather than re-fetch the open web. Then crawl deltas and the niche your business cares about.

URL Canonicalization

Before dedup, normalize: lowercase scheme and host, drop fragment (#anchor), sort query params, drop tracking params (utm_*, fbclid), strip default ports, resolve relative URLs against the base. Without this, HTTPS://Example.com:443/path?b=2&a=1#x and https://example.com/path?a=1&b=2 are different URLs and you crawl both.

Honor <link rel="canonical"> tags — the page tells you its preferred URL. Treat duplicates as references to the canonical, not as separate pages.

Failure Modes

Spider trap — a site generates infinite URLs (calendar with /day/{n}). Defenses: per-host URL count cap, depth limit, regex pattern detection, tarpit blacklist.
Politeness violation cascade — a misconfiguration sends 1000 RPS to one host. Hard rate limit upstream of the fetcher, monitored.
Bloom filter saturation — FPR climbs as you cram more URLs in than sized for. Periodically swap to a fresh, larger filter (or use a counting Bloom for online resizing).
WARC corruption — partial writes on crash. Use the WARC file's offset-based recovery and write atomically (temp file + rename).

FAQ

Should I render every page?

No — 95%+ of pages have content in the HTML. Render only when the HTML is empty or links are missing (heuristic: less than 200 words extracted, fewer than 3 outgoing links).

How do you prioritize re-crawls?

Page-rank weighted recency model: high-rank pages re-crawl daily, long-tail every 30 days. Boost on detected change (RSS, sitemap lastmod, content hash diff).

What about HTTPS and certificate issues?

Validate certs in production; record cert errors as a soft signal (often a real error, sometimes site neglect). Do not bypass; a "skip cert check" mode is how you get phished.

Storage: WARC vs custom format?

WARC is the standard, parseable by every search-tool, supported by Common Crawl tooling. Use it unless you have a strong reason; custom formats become a tax on your future self.