Architecture

Seed list URL Frontier priority + politeness queue per host Fetcher pool async HTTP + headless browser Robots cache robots.txt, crawl-delay URL dedup Bloom filter / RocksDB Content store WARC files in S3 Parser extract links, canonicalize Index pipeline (downstream)

Capacity Estimation

MetricValueNotes
Pages to crawl~50 Bopen web indexable
Re-crawl cycle2–30 daysby importance
Fetch rate~50 K pages/ssteady-state
Avg page size~80 KBtext + minimal assets
Bandwidth~32 Gbps50 K × 80 KB
Storage / pass~4 PBcompressed WARC
URL dedup memory~80 GBBloom filter, 1011 URLs, 1% FPR

Politeness

The first rule: do not melt other people's sites. The crawler must:

  • Fetch and respect /robots.txt per host. Cache for 24 h. Honor Disallow, Crawl-delay, and Sitemap directives.
  • Limit concurrency per host — one in flight per IP is the safe default; 1 req per crawl-delay seconds otherwise.
  • Set a unique User-Agent with a contact URL so site owners can reach you.
  • Honor 429/503 with Retry-After, exponential backoff per host on errors, and immediate stop on persistent 5xx (their site is dying).

Politeness violations get you IP-banned, then ASN-banned, then a public shaming. Major search-engine crawlers operate from rotating but well-known IP ranges with reverse-DNS verification.

URL Frontier

The frontier holds all URLs awaiting fetch, sorted by priority and constrained by politeness. The classic Mercator design (Heydon & Najork) uses two-tier queues:

  • Priority queue (front-end) — F front queues, each accepting URLs at a priority level (1..F). Producer hashes URL into one queue based on importance heuristic (PageRank, freshness target, depth).
  • Politeness queue (back-end) — B back queues, each dedicated to a single host. The fetcher pulls from a back queue, processes the URL, the host is unlocked after the polite delay.
  • Heap of next-fire times — pick the back queue whose host's polite delay has elapsed.

Sizing: front queues 1–10, back queues 100K–1M (one per active host). Persist the frontier to RocksDB / Postgres so a crash does not lose progress.

Distributed Dedup

"Have I seen this URL?" must be O(1) and survive across worker restarts. Three layers:

  • Bloom filter (in-memory) — O(1), space-efficient, <1% false positive rate at the cost of occasional re-fetch. Sized for the URL space you want to remember (1010 URLs × ~10 bits each = ~12 GB per filter). Sharded by host hash so each crawler has its slice.
  • Persistent KV (RocksDB / Cassandra) — durable record of (url_hash, last_crawled, etag) for re-crawl decisions.
  • Content dedup — even after URL dedup, two URLs can serve identical content (canonical, mobile, tracked). Hash the normalized response body (SimHash) and skip near-duplicates.

Routing across the cluster: hash URL by host so all URLs of one host land on one node. This colocates the politeness queue with its dedup state and avoids cross-shard chatter.

Headless Rendering

Modern web pages are JavaScript apps; raw HTML is often empty. Render via a headless browser farm (Chromium / Playwright / Puppeteer):

  • Page-load timeout 10–30 s, abort on hang.
  • Block third-party trackers, ads, fonts — you want content, not the marketing payload. Saves 70% of bandwidth.
  • Run multiple isolated profiles per host with rotating proxies; expect anti-bot defenses (Cloudflare, Akamai bot manager) to challenge.
  • Cost: a headless render is ~100× the CPU and memory of an HTTP fetch. Use only when raw HTML is empty or marked SPA.

Triage: HTTP fetch first; if the response body has JS-mostly content (low text/markup ratio, missing extracted links), promote to render queue. Most pages do not need rendering; saving the 99% you can fetch raw is the difference between a $1M and $100M crawler.

The Common Crawl Approach

Common Crawl publishes a monthly >3 PB snapshot of the public web in WARC format (Web ARChive: a concatenation of HTTP request/response records). Their pipeline is instructive:

  • Apache Nutch generates the frontier and dispatches fetches.
  • Output is grouped into 1 GB WARC files, stored in S3 (paid by AWS Open Data).
  • An index (Parquet on S3 + Athena) maps URL prefix and timestamp to byte ranges in the WARC, so consumers can pull just the records they need.
  • Re-crawl strategy: tier sites by traffic and expected freshness; re-visit news daily, long-tail every few months.

For your own crawler: you almost certainly want to start with Common Crawl rather than re-fetch the open web. Then crawl deltas and the niche your business cares about.

URL Canonicalization

Before dedup, normalize: lowercase scheme and host, drop fragment (#anchor), sort query params, drop tracking params (utm_*, fbclid), strip default ports, resolve relative URLs against the base. Without this, HTTPS://Example.com:443/path?b=2&a=1#x and https://example.com/path?a=1&b=2 are different URLs and you crawl both.

Honor <link rel="canonical"> tags — the page tells you its preferred URL. Treat duplicates as references to the canonical, not as separate pages.

Failure Modes

  • Spider trap — a site generates infinite URLs (calendar with /day/{n}). Defenses: per-host URL count cap, depth limit, regex pattern detection, tarpit blacklist.
  • Politeness violation cascade — a misconfiguration sends 1000 RPS to one host. Hard rate limit upstream of the fetcher, monitored.
  • Bloom filter saturation — FPR climbs as you cram more URLs in than sized for. Periodically swap to a fresh, larger filter (or use a counting Bloom for online resizing).
  • WARC corruption — partial writes on crash. Use the WARC file's offset-based recovery and write atomically (temp file + rename).

FAQ

Should I render every page?

No — 95%+ of pages have content in the HTML. Render only when the HTML is empty or links are missing (heuristic: less than 200 words extracted, fewer than 3 outgoing links).

How do you prioritize re-crawls?

Page-rank weighted recency model: high-rank pages re-crawl daily, long-tail every 30 days. Boost on detected change (RSS, sitemap lastmod, content hash diff).

What about HTTPS and certificate issues?

Validate certs in production; record cert errors as a soft signal (often a real error, sometimes site neglect). Do not bypass; a "skip cert check" mode is how you get phished.

Storage: WARC vs custom format?

WARC is the standard, parseable by every search-tool, supported by Common Crawl tooling. Use it unless you have a strong reason; custom formats become a tax on your future self.