9:00 -
We do not know how many web sites the Web contains. Instead of defining the goal as "crawl the entire Web", we can define the goal as "crawl as many web sites in N days".
No API here. User gives a list of starting URLs to the system, perhaps as configuration.
These are the main things we need to store:
(1) HTML pages we crawled and downloaded. As discussed earlier, we expect this to be 625TB in 2 years. We need buffer for growth and backup, so this can easily be more than 1PB. Blob store and Cassandra come to mind as highly scalable data store. Between them, Blob Store is simpler, cheaper, and suitable for storing unstructured data such as a web page. Cassandra would be better if we write small pieces of data quickly (in a real time), but this use case does not have real time write requirement. This use case also requires rewriting the data. Web pages change often, so we may want to rewrite the data when a crawler re-visits the page after some days. Rewriting is more efficient in a blob store, compared to Cassandra (which is optimized for appending).
(2) Next URLs pages to crawl. This is essentially a queue for Breadth First Search. Unlike (1), this is transient data. We push URLs to the queue, and we pop URLs from the queue. Size would be much smaller. However, it is important that this is backed up. If we lose this data, we'd have to restart from parsing HTML pages, which would be costly. A message broker with persistent storage, such as Kafka, is a good fit.
(3) Metadata of visited URLs. The timestamp of when we visited the URL recently. If the visit was successful or not. This can be stored in RDB.
(4) Because we are traversing a uni-directional graph, it seems useful to store this graph, so that we can make graph queries (e.g. which pages can you go from this page? Which pages lead to this page under N hops?). A graph DB would be a good fit for this data.
See the diagram for high-level design.
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
[Mid-level deep dive topic.]
Data stores, i.e., Blob Store, Message Queue, and Metadata Store, should all be partitioned for scalability. All the storages are keyed by URL. Therefore, a hash of URL would provide a good partitioning key.
Both Fetchers and Parsers are stateless services. We can run multiple copies of these services for scalability. Both Fetchers and Parsers dictate their own pace of working. When they have capacity, they pick up new tasks from the queue (Fetcher from Next URLs queue; Parser from downloaded text queue).
[Junior-level deep dive topic.]
Web Crawling is a graph traversal process. Each web page represents a node. Each hyper link represents an edge.
There are several graph traversal algorithm, the most common ones being Breadth First Search vs Depth First Search. Let's discuss these two algorithms.
If crawlers traverse the Web using BFS, crawlers would reach many different web pages, spanning across many different web servers, in a relatively short amount of time.
If crawlers use DFS, crawlers would explore deeper into a specific chain of URLs.
BFS makes more sense, as most crawlers' purpose is to discover and collect texts from wide variety of web pages.
It may be possible to use some priority to guide traversal. For example, prioritizing pages with higher PageRank score might improve the quality of texts we can gather.
Let's look at some failure scenarios and mitigations one by one.
[Mid-level deep dive topic.]
[Fault Tolerance is an important aspect of any system design solution. The design usually calls for many commodity machines connected by commodity network (e.g. Ethernet). Commodity machines and networks have higher failure rate than special purpose hardware. Therefore, it is important to make sure the entire system keeps working, even if some nodes and networks fail. You can demonstrate how deeply you think about fault tolerance by pointing out potential, likely failure scenarios and mitigations.]
It would be very common for a Fetcher to fail to download HTML files from the web site. The site may be down (temporarily or permanently), it may be too slow (causing a timeout), it may have an expired or invalid certificate, and do on.
Fetcher should retry, after some time. The website may recover after some time (e.g. becomes less busy, or a site operator fixes it). We can use a message queue for the retry. We can create a new message in a different topic for URLs to be retried. Have Fetcher pull from this queue. This message should include the number of times we have already retried (so that we can give up after N times), and the next retry time, using exponential backoff.
Fetcher should not keep going to the same URLs over and over, as that would be unproductive. There is a risk, for example, two web pages link to each other.
We can use the Metadata Store to see if some Fetcher has visited the URL, and if yes, when. Fetcher should not visit a URL if it has been visited within N days. We can make this number configurable.
We expect Parser to fail to parse a web page occasionally. Web pages may contain ill-formatted HTML text, unsupported character code, or it may simply be too large.
It is less likely such pages would be fixed after time. Therefore, we would store (a) we could not parse this URL, and (b) the reason / error messages in the Metadata Store and an error log.
System operator can check the stored errors offline, and analyze if Parser should be enhanced to be able to parse such pages, or simply ignore these pages.
If Fetcher is writing too rapidly for Blob Store, Fetcher may have to wait for Blob Store, decreasing the whole system's performance. Fetcher may even time out before write can complete.
We should implement monitoring and alerting on Blob Store's performance to detect such issues.
We can adjust Blob Store's configuration if this happens. One advantage we have is, the speed of writes can be controlled by the number of Fetcher processes. (Unlike a system where write speed is dictated by external sources, e.g., clients uploading huge files rapidly.). We can control the speed of writes by adjusting Fetchers (e.g. number of active Fetcher servers), and the write throughput (e.g. number of partitions and servers that make up Blob Store).
We could add one more Message Broker between Fetcher and Blob Store to absorb varying write speed by Fetchers. But this would complicate the architecture. We would prefer the approach mentioned above.
Including images and videos would be interesting. It would drastically increase the necessary scale and provide new challenges.