Design A Web Crawler - System Design

System requirements

9:00 -

Functional:

User gives the web crawler a list of URLs to start crawling from.
The purpose of this web crawler is to download and store static pages.
For now, we will focus on HTML pages. We will not store images or videos.

We do not know how many web sites the Web contains. Instead of defining the goal as "crawl the entire Web", we can define the goal as "crawl as many web sites in N days".

Non-Functional:

Reliability - fault tolerant
Scalability - horizontally scale the number of crawlers
Data durability - once we crawl the data, don't lose them.

Capacity estimation

Each web site, on average, has 100 pages.
Each page, on average, is 10KB HTML file.
We have 1000 crawler servers
We have 10 crawler processes running on each server.
We expect 1 crawler process to complete 1 page in 10 seconds.
That means, for each process, 6 * 60 * 24 * 7 = 60K pages per 7 days.
60K (pages) * 10 (processes) * 1000 (servers) = 600M pages in 7 days
600M * 10KB = 6TB of data
625 TB in 2 years.

API design

No API here. User gives a list of starting URLs to the system, perhaps as configuration.

Database design

These are the main things we need to store:

(1) HTML pages we crawled and downloaded. As discussed earlier, we expect this to be 625TB in 2 years. We need buffer for growth and backup, so this can easily be more than 1PB. Blob store and Cassandra come to mind as highly scalable data store. Between them, Blob Store is simpler, cheaper, and suitable for storing unstructured data such as a web page. Cassandra would be better if we write small pieces of data quickly (in a real time), but this use case does not have real time write requirement. This use case also requires rewriting the data. Web pages change often, so we may want to rewrite the data when a crawler re-visits the page after some days. Rewriting is more efficient in a blob store, compared to Cassandra (which is optimized for appending).

(2) Next URLs pages to crawl. This is essentially a queue for Breadth First Search. Unlike (1), this is transient data. We push URLs to the queue, and we pop URLs from the queue. Size would be much smaller. However, it is important that this is backed up. If we lose this data, we'd have to restart from parsing HTML pages, which would be costly. A message broker with persistent storage, such as Kafka, is a good fit.

(3) Metadata of visited URLs. The timestamp of when we visited the URL recently. If the visit was successful or not. This can be stored in RDB.

(4) Because we are traversing a uni-directional graph, it seems useful to store this graph, so that we can make graph queries (e.g. which pages can you go from this page? Which pages lead to this page under N hops?). A graph DB would be a good fit for this data.

High-level design

See the diagram for high-level design.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Fetcher starts by picking up a new URL in the Next URLs list.
Fetcher goes to the web site on the URL (1)
Fetcher downloads the HTML file, stores it in Blob Store, and write the arrival of the page in Message Queue. (2)
Parser pulls a message from the Message Queue.
Parser reads the associated HTML file from Blob Store.
Parser parses the HTML file. It pushes the next URLs to the Next URLs queue.

Detailed component design

Scalability

[Mid-level deep dive topic.]

Data stores, i.e., Blob Store, Message Queue, and Metadata Store, should all be partitioned for scalability. All the storages are keyed by URL. Therefore, a hash of URL would provide a good partitioning key.

Both Fetchers and Parsers are stateless services. We can run multiple copies of these services for scalability. Both Fetchers and Parsers dictate their own pace of working. When they have capacity, they pick up new tasks from the queue (Fetcher from Next URLs queue; Parser from downloaded text queue).

Trade offs/Tech choices

Traversal Algorithm

[Junior-level deep dive topic.]

Web Crawling is a graph traversal process. Each web page represents a node. Each hyper link represents an edge.

There are several graph traversal algorithm, the most common ones being Breadth First Search vs Depth First Search. Let's discuss these two algorithms.

If crawlers traverse the Web using BFS, crawlers would reach many different web pages, spanning across many different web servers, in a relatively short amount of time.

If crawlers use DFS, crawlers would explore deeper into a specific chain of URLs.

BFS makes more sense, as most crawlers' purpose is to discover and collect texts from wide variety of web pages.

It may be possible to use some priority to guide traversal. For example, prioritizing pages with higher PageRank score might improve the quality of texts we can gather.

Failure scenarios/bottlenecks

Let's look at some failure scenarios and mitigations one by one.

[Mid-level deep dive topic.]

[Fault Tolerance is an important aspect of any system design solution. The design usually calls for many commodity machines connected by commodity network (e.g. Ethernet). Commodity machines and networks have higher failure rate than special purpose hardware. Therefore, it is important to make sure the entire system keeps working, even if some nodes and networks fail. You can demonstrate how deeply you think about fault tolerance by pointing out potential, likely failure scenarios and mitigations.]

Fetcher fails to download from web site

It would be very common for a Fetcher to fail to download HTML files from the web site. The site may be down (temporarily or permanently), it may be too slow (causing a timeout), it may have an expired or invalid certificate, and do on.

Fetcher should retry, after some time. The website may recover after some time (e.g. becomes less busy, or a site operator fixes it). We can use a message queue for the retry. We can create a new message in a different topic for URLs to be retried. Have Fetcher pull from this queue. This message should include the number of times we have already retried (so that we can give up after N times), and the next retry time, using exponential backoff.

Fetcher Loops

Fetcher should not keep going to the same URLs over and over, as that would be unproductive. There is a risk, for example, two web pages link to each other.

We can use the Metadata Store to see if some Fetcher has visited the URL, and if yes, when. Fetcher should not visit a URL if it has been visited within N days. We can make this number configurable.

Parser fails to parse a web page

We expect Parser to fail to parse a web page occasionally. Web pages may contain ill-formatted HTML text, unsupported character code, or it may simply be too large.

It is less likely such pages would be fixed after time. Therefore, we would store (a) we could not parse this URL, and (b) the reason / error messages in the Metadata Store and an error log.

System operator can check the stored errors offline, and analyze if Parser should be enhanced to be able to parse such pages, or simply ignore these pages.

Blob Store cannot catch up to writes

If Fetcher is writing too rapidly for Blob Store, Fetcher may have to wait for Blob Store, decreasing the whole system's performance. Fetcher may even time out before write can complete.

We should implement monitoring and alerting on Blob Store's performance to detect such issues.

We can adjust Blob Store's configuration if this happens. One advantage we have is, the speed of writes can be controlled by the number of Fetcher processes. (Unlike a system where write speed is dictated by external sources, e.g., clients uploading huge files rapidly.). We can control the speed of writes by adjusting Fetchers (e.g. number of active Fetcher servers), and the write throughput (e.g. number of partitions and servers that make up Blob Store).

We could add one more Message Broker between Fetcher and Blob Store to absorb varying write speed by Fetchers. But this would complicate the architecture. We would prefer the approach mentioned above.

Future improvements

Including images and videos would be interesting. It would drastically increase the necessary scale and provide new challenges.

System requirements

9:00 -

Functional:

User gives the web crawler a list of URLs to start crawling from.
The purpose of this web crawler is to download and store static pages.
For now, we will focus on HTML pages. We will not store images or videos.

We do not know how many web sites the Web contains. Instead of defining the goal as "crawl the entire Web", we can define the goal as "crawl as many web sites in N days".

Non-Functional:

Reliability - fault tolerant
Scalability - horizontally scale the number of crawlers
Data durability - once we crawl the data, don't lose them.

Capacity estimation

Each web site, on average, has 100 pages.
Each page, on average, is 10KB HTML file.
We have 1000 crawler servers
We have 10 crawler processes running on each server.
We expect 1 crawler process to complete 1 page in 10 seconds.
That means, for each process, 6 * 60 * 24 * 7 = 60K pages per 7 days.
60K (pages) * 10 (processes) * 1000 (servers) = 600M pages in 7 days
600M * 10KB = 6TB of data
625 TB in 2 years.

API design

No API here. User gives a list of starting URLs to the system, perhaps as configuration.

Database design

These are the main things we need to store:

(3) Metadata of visited URLs. The timestamp of when we visited the URL recently. If the visit was successful or not. This can be stored in RDB.

High-level design

See the diagram for high-level design.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Fetcher starts by picking up a new URL in the Next URLs list.
Fetcher goes to the web site on the URL (1)
Fetcher downloads the HTML file, stores it in Blob Store, and write the arrival of the page in Message Queue. (2)
Parser pulls a message from the Message Queue.
Parser reads the associated HTML file from Blob Store.
Parser parses the HTML file. It pushes the next URLs to the Next URLs queue.

Detailed component design

Scalability

[Mid-level deep dive topic.]

Trade offs/Tech choices

Traversal Algorithm

[Junior-level deep dive topic.]

Web Crawling is a graph traversal process. Each web page represents a node. Each hyper link represents an edge.

There are several graph traversal algorithm, the most common ones being Breadth First Search vs Depth First Search. Let's discuss these two algorithms.

If crawlers traverse the Web using BFS, crawlers would reach many different web pages, spanning across many different web servers, in a relatively short amount of time.

If crawlers use DFS, crawlers would explore deeper into a specific chain of URLs.

BFS makes more sense, as most crawlers' purpose is to discover and collect texts from wide variety of web pages.

It may be possible to use some priority to guide traversal. For example, prioritizing pages with higher PageRank score might improve the quality of texts we can gather.

Failure scenarios/bottlenecks

Let's look at some failure scenarios and mitigations one by one.

[Mid-level deep dive topic.]

Fetcher fails to download from web site

Fetcher Loops

Fetcher should not keep going to the same URLs over and over, as that would be unproductive. There is a risk, for example, two web pages link to each other.

We can use the Metadata Store to see if some Fetcher has visited the URL, and if yes, when. Fetcher should not visit a URL if it has been visited within N days. We can make this number configurable.

Parser fails to parse a web page

We expect Parser to fail to parse a web page occasionally. Web pages may contain ill-formatted HTML text, unsupported character code, or it may simply be too large.

It is less likely such pages would be fixed after time. Therefore, we would store (a) we could not parse this URL, and (b) the reason / error messages in the Metadata Store and an error log.

System operator can check the stored errors offline, and analyze if Parser should be enhanced to be able to parse such pages, or simply ignore these pages.

Blob Store cannot catch up to writes

If Fetcher is writing too rapidly for Blob Store, Fetcher may have to wait for Blob Store, decreasing the whole system's performance. Fetcher may even time out before write can complete.

We should implement monitoring and alerting on Blob Store's performance to detect such issues.

Future improvements

Including images and videos would be interesting. It would drastically increase the necessary scale and provide new challenges.