Design a web crawler that can crawl 1 billion pages per day
by solstice8452
12
513
Web crawler is a Google classic and I was well prepared for it.
Back of the envelope: 1B pages/day = ~11,500 pages/second. Average page size 500KB, so about 5.7 GB/s of bandwidth. Need distributed architecture with hundreds of crawler nodes.
Key components: URL frontier (priority queue of URLs to crawl), fetcher (HTTP client pool), parser (extract links and content), URL deduplicator, politeness enforcer (robots.txt, rate limiting per domain).
The URL frontier design was the most interesting part. Used a combination of front queues (priority-based) and back queues (one per domain for politeness). This ensures high-priority pages get crawled first while respecting rate limits.
For deduplication, discussed using a Bloom filter for quick URL lookups with a backing store for false positive resolution. Content deduplication uses SimHash to detect near-duplicate pages.
The DNS resolver was a detail I brought up proactively. With 11K requests/second, DNS can become a bottleneck. Implemented a local DNS cache with TTL-based expiration.
Follow up was about handling JavaScript-rendered pages. Discussed headless browser pool (Puppeteer/Playwright) for a subset of pages that require JS execution, with heuristics to identify which pages need it.
The politeness discussion went well. Beyond robots.txt, I talked about adaptive rate limiting based on server response times and HTTP 429/503 responses.