Web Crawler System Design: Politeness Is a Sharding Problem

November 26, 2025

A web crawler has five moving parts. A URL frontier holds the queue of URLs to fetch, usually with per-host priority so the crawl can be polite. A fetcher pool talks HTTP to the world with throttling and retry logic. A parser extracts links and content from the fetched bytes. A dedup layer canonicalizes URLs and consults a bloom filter so the same page is not fetched twice. Storage is split: raw HTML goes to blob storage like S3 so you can re-parse without re-crawling, and parsed content goes to a search index.

The decoupling between fetcher and parser is worth its weight in operability. Fetchers are IO-bound, parsers are CPU-bound, and they fail in different ways. A message queue between them lets each scale independently and recover independently. Drop the parser, the fetcher keeps stuffing the queue. Drop the fetcher, the parser drains what is queued.

The metadata store is the brain. It tracks what has been seen, when it was last fetched, the crawl depth, retry counts, and the per-host crawl budget. Without it, the crawler endlessly re-fetches the same pages and pummels hosts that asked it not to.

Politeness is the part everyone underestimates. robots.txt, crawl delay, per-host rate limits, exponential backoff on 429s, identifying your bot with a real User-Agent and a contact URL. Skip these and you do not have a crawler, you have a denial-of-service tool that crawls.

The distributed shape is where it gets subtle. With one crawler node, per-host rate limiting is trivial: keep a token bucket per host in memory. With N crawler nodes, you have to decide who owns the rate limit. If you shard by URL hash, every node sees a roughly even slice of every host. Each node enforces its 1 RPS budget for example.com. The actual traffic to example.com is N RPS.

This is the failure I watched. A team hashed URLs across thirty-two crawler nodes for load balance. Each node respected a one-request-per-second-per-host limit. Their target sites saw thirty-two requests per second from the crawler's IP range. One host's WAF flagged the pattern and blocked the entire CIDR. Half the crawler's targets stopped responding overnight. Engineers spent a week unwinding the IP block and convincing site operators they were not malicious.

The fix was to shard by domain instead of by URL. Consistent hashing on the host portion routes every URL for example.com to the same crawler node. That node owns the per-host rate limit, the per-host crawl-delay timer, the per-host backoff state. Politeness becomes local state on the shard owner, not a global agreement that nobody is enforcing.

The lesson generalizes. When a rule is per-key, the shard key must be the key the rule is about.

Key takeaway

Hash the URL and you spread one host across every crawler node, each enforcing its own rate limit. Hash the domain and a single shard owns each host, so the rate limit means what you think it means.

Originally posted on LinkedIn. View original.