Design A Web Crawler - System Design

System requirements

Functional:

Given a set of seed urls, crawl the web for these urls.
For every url, if there are more URLs, extract the content from these web pages, and store them for later processing.

Non-Functional:

Politeness to adhere to robots.txt and not crawl the pages listed in that.
Highly scalable system - to handle upto crawling of 10B pages,
Storage - Storing the data from the 10B pages crawled.
Fault Tolerance - The system should be able to handle failures, and restart from failed state, without having to crawl over everything again.

Capacity estimation

Considering we are crawling about 10B web pages, each web page size is approximately 2MB - the system should be roughly able to handle 10B*2MB roughly 2000 TB of data, which is roughly 2 Petabytes of data.

Let's assume that each web page has about 10 media files ,and the size of every media file is about 1 GB, so roughly we would need about

10 B * 1GB , that will be roughly about 1000 Petabytes of data.

API design

1) /addurl - to add Urls to the frontier queue

2) /fetchUrl - fetches the URL and data associated with it from webpage and DNS lookup.

3) /extractTextContent - fetches the text and static content from webpage.

4)/extractMediaContent - extracts media content from webpage.

5) /putURLinParser - adds the fetched URL in the parsing queue.

Database design

Different kinds of database to suit different needs

Relational Database - to store the URL and the URL metadata, we will use a RDBMS database.
For large media files - we will use a S3 storage
For text data which will mostly be static content and semi-structured data via HTML and CSS - we'll use a MongoDB store.

High-level design

Frontier Queue - This will be the initial queue, with the list of all the seed URLs, to crawl from.
URL Fetcher - This will be the component to do the actual crawling by looking up the URL in the DNL lookup and crawling the webpage.
Parser Queue - this will be a Queue to queue up the URLs to be processed and parsed.
Media File extractor-This will extract the media files from the webpage and store them in S3.
Text File Extractor- This will parse the webpage for the static text content and store them in MongoDB.
Databases - Different kinds of database (a mix of RDBMX, Blob storage and NoSQL database) to store the content parsed.

Request flows

The seed URLs are placed in the frontier queue.
The URL fetcher picks up the URL, looks up the DNS resolver to resolve the hostname.
It looks up the robots.txt file for this domain and fetches the webpage, if its an eligible one to be fetched
The URL fetcher puts the raw HTML in the blob storage., and the url Metadata in the metadata DB.
The fetched URL is put in the parser queue to be picked up by a parser worker.
The parser worker has two components. the URL and text extractor extracts the text and puts in in the DynamoDb. It puts back the extracted URLs in the frontier queue.
the media files extractor will extract the media fiels and put it in the media file s3 storage.
Before putting the data in the metadata db, the url contents are hashed and checked if they are not already present in the metadata.

Detailed component design

The Frontier Queue - Although this can be a Kafka based queue, in order to handle retries we need to be able to put a retry mechanism. For Kafka, there can be a failure topic with the list of URLs failed, from which the URL fetcher can retry the fetch with an exponential backoff. However, an SQS queue provides for an inbuilt exponential backoff with a default visibility timeout starting from 30 seconds. and will increase exponentially to 2 minutes, 5 minutes, upto 15 minutes. This backoff limit is also configurable. Of course, we need to set a cap on the backoff, and put the message in the dead letter queue if it fails to process after x number of retries.
Robots.txt File - We need to be able to adhere to the politeness factor. Robots.txt tells web crawlers which pages are not allowed to crawl. We need to download the robots.txt file for each domain that we crawl and store it in the metadata db. before crawling, the crawler should check if that URL is not in the list of pages as a part of robots.txt. If it is in the list, then it should drop it, else it can go and crawl that page. Before crawling it should check the crawl-delay directive to ensure if that time is respected, if its not, then put the URL back in the queue. If it is elapsed, then crawl the webpage, and update the last crawled time of the domain
A robots.txt file looks like the below

User-agent: *

Disallow: /private

/Crawl-delay: 10

crawl delay specifies the time to wait between requests for that domain. We need to respect this, and enqueue further urls for this domain that can be crawled in the frontier queue to limit the appropriate time gap between 2 requests.

Trade offs/Tech choices

Choosing S3 for large media files. Media files will be huge and need to be stored in Blob storage.

We can also chose S3 for text, but a DynamoDB will also provide eventual consistency for the text data.

Choosing SQL queues instead of Kafka for all the queues in the system to take advantage of the out-of-the-box exponential backoff for SQS.

Failure scenarios/bottlenecks

If a crawler fails before fetching all the URLs from the queue, then the URLS remain the queue until another crawler picks them up. Since we are using an SQL queue, the SQL will use the visibility timeout to hide a message from other crawlers, once it's fetched. However if a crawler fails before fetching an URL, then after the visibility timeout expires, the URL becomes visible in the queue to the other crawlers to pick up. Once the URL has been fetched and stored in the S3 for further processing, the crawler will delete the message from the queue, so that it's no longer available to be picked up.

Crawling URLs that point to the same website - https://www.abc.com and https://abc.com point to the same website. In order to avoid crawling them twice, we can hash the content of the crawled apge and store that as a column in the metadata DB. While fetching the duplicate URL, we look out for the same hash in the DB. if it exists, we don't store anything from that, and discard that from the working queue.

Looking out for spider traps - that are usually made to let the crawler enter an infinite loop. In order to solve this problem, we'll provide a depth for every domain. We'll increment this count every time we crawl a linked page from a fetched page for that domain. Once the threshold depth has been reached, we stop.

DNS lookups can be a bottleneck if we are using third party DNS resolvers. Caching the DNS lookups on the crawler will help reduce the number of requests made to DNS lookup providers.

Future improvements

Can work with multiple crawl servers geographically distributed, the servers closer to the hosts will be more efficient in fetching the URLs.

Implementing kind of an URL scheduler which will schedule the URLs that have already been fetched to check for any changes in data/content, and update the database on our side.