POST /seed/urls - body:{urls:JSON} this will be called to add the seed urls to start crawling. urls will be added there as strings comma separated
POST /url/limit - body:{excludedDomains?:JSON, depthLimit:number, urlLengthLimit:number} - this is called by admins to add excluded/black listed domains, a depth limit for the search so it doesn't go too deep, and a url length limit to try to prevent spider crawl links
GET /metrics - returns the metrics for admins
seed URLs are the predefined urls needed to start crawling;
a URL reaches the URL frontier which has a prioritizer which pushes to multiple front queues the URLs. the queues contain URLs from the same domain inside each one. these queues are responsible for prioritization (eg home pages are prioritized over blog unofficial pages).
afterwards, messages from front queues are consumed by the front queue selector which has a higher probability to consume from a queue with a higher priority. it forwards the url to the back queue router, which also checks a mapping table which maps the host to its back queue, this way it ensures that each back queue contains urls from its own domain only.
back queues are then called; they contain urls from the same host; they are used for politeness.
back queue selector is selecting the queue that will be consumed from. worker threads (HTML Downloaders) start consuming from back queues, they only download from one request from each host at a time to ensure politeness, here we can also enforce a delay in consuming from the same host (mapped to a back queue).
this entire logic is enforcing crawling only one url at a time from the same host.
each HTML downloader needs to call DNS to first fetch the corresponding IP of the domain and caches the result. when fetching again, it can look firstly in the cache and only if it does not find it, it makes a request to DNS)
the request then reaches Content Loader, which is responsible for cheching that the same content wasn't already processed before under a different URL. in order to do this, it checks the content DB which stores the hashes of the pages' content and we compare the current page hash with the existing ones, if we find one matching than we stop the crawling as we already processed this content. if we didn't find any match than we save the content and its hash to content DB and then we reach link extractor which extracts all the links within the page. URL filter than checks if there are any blacklisted domains/links set up by the admins, also checks the length limit set and if it passes all the filters, it reaches URL finder which checks in cache firstly and in case it didnt find it, in the URL db, if the found URL wasnt already processed before, if it wasnt it pushes it to frontier and starts proccesing it.
the prioritizer checks multiple signals and computes a score based on them. the signals are:
the html downloader implements retrials with exponential back off, if the download failed with 4xx code, than it adds it back to the processing queue with a delay; if it is of type 5xx then it saves the url in the db and marks it accordingly (as failed or corrupted), if the max retries were exceeded then it adds it to dlq and it will be manually handled.
Under high load, the front queue selector adjusts consumption ratios dynamically. Normally it might consume 60% from high-priority queue, 30% medium, 10% low. Under load, it shifts to 80/15/5 to ensure important URLs are crawled first.
each component is decoupled with queues and backpressure metrics, for queue depth monitoring - when a queue reaches 80% capacity, the upstream producers slow down; pull-based consumption - workers pull at their own pace from queues; circuit breaker - if content db or dns is slowed down, we pause fetching to avoid cascading failures.
the system handles partial failures: in case the dns stops working, we can use cached IPs from DNS cache and skip temporarily the failing domains; if content db is down, we use temporarily the data stored on disk and in memory; if some workers crash, the remaining ones will continue working independently; if an entire domain queue is failing, all the others will continue to work
the urls reach Prioritizer which computes the priority and maps the urls to their specific priority queue. the prioritization process takes into account the following:
the number of inbound links - the higher the number, the higher the priority;
how nested the url is (checks number of / chars) - a url less nested is preffered;
trusted domain - wikipedia and news websites are preffered;
freshness;
page type - home pages are preffered.
based on the priority resulting number, the urls are mapped to front queues which then get selected by front queue selector which has a bias towards higher prio queues and then it forwards to back queue router which checks a mapping table to know to which back queue to push the message URLs. these are used for politeness; back queue selector choosed one queue and forwards one message at a time from a domain to the worker thread. the worker thread can also add a small delay so the crawled domain isnt impolitely used
Redis failure recovery uses a tiered fallback with automatic failover