Design A Web Crawler - System Design

System requirements

Functional:

Web Crawling, able to initiate HTTP requests to web pages.
support different protocols(HTTP/HTTPS)
support various content types.
data storage
scheduled periodically crawling.
respect website's robot policy, Respect rules specified in robots.txt, including crawl-delay and disallow directives.
prevent infinite loops.

Non-Functional:

performance
scalability
robustness, handle error and ensure continue operation instead of stuck.
throttling, make sure the crawler not DDOS website servers.

Capacity estimation

The system crawls 1 billion web pages per month
Each web page has 10 media files on average
An average media file size of 1MB

API design

restful API

/api/addURL

/api/getNextURL

/api/crawlPage

/api/getCrawledData

/api/removeURL

/api/getMediaForURL

Database design

URL table urlId, crawlId

CrawledData table urlId, crawlId, metadata, cssContent, mediaId, htmlContent

Media table: mediaId, crawlId, mediaUrl, mediaType,

High-level design

crawler controller.
URL Queue. Manage queue of URLs to be crawled.
URL deduplication.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Optimization Strategies:

Crawl Depth Control: Implementing crawl depth control allows the system to limit how many levels deep it follows links, preventing the crawler from going too deep into a website's hierarchy.
URL Deduplication: Ensuring that duplicate URLs are not crawled again helps avoid redundant crawling, optimizing the efficiency of the web crawler system.
Dynamic Throttling: Implementing dynamic rate limiting based on the server's response times helps prevent overloading servers and promotes fair and respectful crawling.
Distributing Requests: Requests to the same web server should be distributed across different servers, also there should be some gap between sending requests to the same server to prevent flooding the target servers.

Rate limiting strategies

Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.

The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue.

How to maintain Freshness?

Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource-intensive. A few strategies to optimize freshness are listed as follows:

Recrawl based on web pages’ update history.
Prioritize URLs and recrawl important pages first and more frequently.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

BFS and problems with BFS

BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:

Most links from the same web page are linked back to the same host. for example, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com).
When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”.

Standard BFS does not consider the priority of a URL. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?