Design A Web Crawler - System Design

Requirements

Functional Requirements:

Accept a list of URLs to initiate crawling.
Download and store static HTML pages from the provided URLs.
Parse pages to find new URLs and add them to the crawl queue.

Non-Functional Requirements:

scalability: there are going to be billions of web pages so crawling should be very efficient and parallelized. thus we need to be able to scale out the system in case of high demand and high spike
robustness: the web is full of malicious pages (like phishing ones or with malicious links) so the crawler should handle this too, especially spider traps (long urls with deep directory structures: spiderwebexample.com/foo/bar/foo/bar/foo... )
politeness: the crawler shouldn't make too many calls to a website within a short time range
extensibility-> this also implies abstraction. we need to be able to extend the functionalities in case we want to crawl pdfs, images, etc in the future
for retrials (due to failures) we should save the crawl states and data and in case of failure, we could easily restart the crawler by loading the saved states and data. System must tolerate: Worker/node failures, Network failures, Partial system outages by using consistent hashing and delegating the work of a failing node, to another working node
data validation: this will prevent system errors
exception handling: exceptions should be handled gracefully without crashing the system
durable storage: we should store the crawled data for 5 years

API Design

POST /seed/urls - body:{urls:JSON} this will be called to add the seed urls to start crawling. urls will be added there as strings comma separated

POST /url/limit - body:{excludedDomains?:JSON, depthLimit:number, urlLengthLimit:number} - this is called by admins to add excluded/black listed domains, a depth limit for the search so it doesn't go too deep, and a url length limit to try to prevent spider crawl links

GET /metrics - returns the metrics for admins

High-Level Design

seed URLs are the predefined urls needed to start crawling;

a URL reaches the URL frontier which has a prioritizer which pushes to multiple front queues the URLs. the queues contain URLs from the same domain inside each one. these queues are responsible for prioritization (eg home pages are prioritized over blog unofficial pages).

afterwards, messages from front queues are consumed by the front queue selector which has a higher probability to consume from a queue with a higher priority. it forwards the url to the back queue router, which also checks a mapping table which maps the host to its back queue, this way it ensures that each back queue contains urls from its own domain only.

back queues are then called; they contain urls from the same host; they are used for politeness.

back queue selector is selecting the queue that will be consumed from. worker threads (HTML Downloaders) start consuming from back queues, they only download from one request from each host at a time to ensure politeness, here we can also enforce a delay in consuming from the same host (mapped to a back queue).

this entire logic is enforcing crawling only one url at a time from the same host.

each HTML downloader needs to call DNS to first fetch the corresponding IP of the domain and caches the result. when fetching again, it can look firstly in the cache and only if it does not find it, it makes a request to DNS)

the request then reaches Content Loader, which is responsible for cheching that the same content wasn't already processed before under a different URL. in order to do this, it checks the content DB which stores the hashes of the pages' content and we compare the current page hash with the existing ones, if we find one matching than we stop the crawling as we already processed this content. if we didn't find any match than we save the content and its hash to content DB and then we reach link extractor which extracts all the links within the page. URL filter than checks if there are any blacklisted domains/links set up by the admins, also checks the length limit set and if it passes all the filters, it reaches URL finder which checks in cache firstly and in case it didnt find it, in the URL db, if the found URL wasnt already processed before, if it wasnt it pushes it to frontier and starts proccesing it.

the prioritizer checks multiple signals and computes a score based on them. the signals are:

page types - they are more prioritar than blog ones or articles
url slashes - pages with less / have higher priority (less nested)
domain - known trusted domains, such as wikipedia or news websites have higher prio
freshness
location
backlink count (pages with more inbound links)

the html downloader implements retrials with exponential back off, if the download failed with 4xx code, than it adds it back to the processing queue with a delay; if it is of type 5xx then it saves the url in the db and marks it accordingly (as failed or corrupted), if the max retries were exceeded then it adds it to dlq and it will be manually handled.

Under high load, the front queue selector adjusts consumption ratios dynamically. Normally it might consume 60% from high-priority queue, 30% medium, 10% low. Under load, it shifts to 80/15/5 to ensure important URLs are crawled first.

each component is decoupled with queues and backpressure metrics, for queue depth monitoring - when a queue reaches 80% capacity, the upstream producers slow down; pull-based consumption - workers pull at their own pace from queues; circuit breaker - if content db or dns is slowed down, we pause fetching to avoid cascading failures.

the system handles partial failures: in case the dns stops working, we can use cached IPs from DNS cache and skip temporarily the failing domains; if content db is down, we use temporarily the data stored on disk and in memory; if some workers crash, the remaining ones will continue working independently; if an entire domain queue is failing, all the others will continue to work

Detailed Component Design

the urls reach Prioritizer which computes the priority and maps the urls to their specific priority queue. the prioritization process takes into account the following:

the number of inbound links - the higher the number, the higher the priority;

how nested the url is (checks number of / chars) - a url less nested is preffered;

trusted domain - wikipedia and news websites are preffered;

freshness;

page type - home pages are preffered.

based on the priority resulting number, the urls are mapped to front queues which then get selected by front queue selector which has a bias towards higher prio queues and then it forwards to back queue router which checks a mapping table to know to which back queue to push the message URLs. these are used for politeness; back queue selector choosed one queue and forwards one message at a time from a domain to the worker thread. the worker thread can also add a small delay so the crawled domain isnt impolitely used

Redis failure recovery uses a tiered fallback with automatic failover