Considering we are crawling about 10B web pages, each web page size is approximately 2MB - the system should be roughly able to handle 10B*2MB roughly 2000 TB of data, which is roughly 2 Petabytes of data.
Let's assume that each web page has about 10 media files ,and the size of every media file is about 1 GB, so roughly we would need about
10 B * 1GB , that will be roughly about 1000 Petabytes of data.
1) /addurl - to add Urls to the frontier queue
2) /fetchUrl - fetches the URL and data associated with it from webpage and DNS lookup.
3) /extractTextContent - fetches the text and static content from webpage.
4)/extractMediaContent - extracts media content from webpage.
5) /putURLinParser - adds the fetched URL in the parsing queue.
Different kinds of database to suit different needs
User-agent: *
Disallow: /private
/Crawl-delay: 10
crawl delay specifies the time to wait between requests for that domain. We need to respect this, and enqueue further urls for this domain that can be crawled in the frontier queue to limit the appropriate time gap between 2 requests.
Choosing S3 for large media files. Media files will be huge and need to be stored in Blob storage.
We can also chose S3 for text, but a DynamoDB will also provide eventual consistency for the text data.
Choosing SQL queues instead of Kafka for all the queues in the system to take advantage of the out-of-the-box exponential backoff for SQS.
If a crawler fails before fetching all the URLs from the queue, then the URLS remain the queue until another crawler picks them up. Since we are using an SQL queue, the SQL will use the visibility timeout to hide a message from other crawlers, once it's fetched. However if a crawler fails before fetching an URL, then after the visibility timeout expires, the URL becomes visible in the queue to the other crawlers to pick up. Once the URL has been fetched and stored in the S3 for further processing, the crawler will delete the message from the queue, so that it's no longer available to be picked up.
Crawling URLs that point to the same website - https://www.abc.com and https://abc.com point to the same website. In order to avoid crawling them twice, we can hash the content of the crawled apge and store that as a column in the metadata DB. While fetching the duplicate URL, we look out for the same hash in the DB. if it exists, we don't store anything from that, and discard that from the working queue.
Looking out for spider traps - that are usually made to let the crawler enter an infinite loop. In order to solve this problem, we'll provide a depth for every domain. We'll increment this count every time we crawl a linked page from a fetched page for that domain. Once the threshold depth has been reached, we stop.
DNS lookups can be a bottleneck if we are using third party DNS resolvers. Caching the DNS lookups on the crawler will help reduce the number of requests made to DNS lookup providers.
Can work with multiple crawl servers geographically distributed, the servers closer to the hosts will be more efficient in fetching the URLs.
Implementing kind of an URL scheduler which will schedule the URLs that have already been fetched to check for any changes in data/content, and update the database on our side.