Design A Web Crawler - System Design

System requirements

Functional:

visit web page
download web page content
store downloaded contents
extract links from downloaded contents
dedupe already visited web pages
repeat step 1-4

Non-Functional:

do not stress a single site too much
do not download contents in robots.txt
visit web pages by popularity and/or importance

Capacity estimation

Assume we download 100 million pages in a week, that's 170 pages per second. Say each page is 1 MB on avg.

Read and write QPS = 170

Write / Read throughput = 170 * 1MB = 170 MB / second

Data = 100 million * 1 MB = 100 TB per week

Say we do web archiving for 10 yrs. This is 100 TB * 52 weeks * 10 yrs = 52 PB

API design

Define what APIs are expected from the system...

Database design

Data is too big to fit into RDB. Also, write and read throughput is very high. We use NoSQL.

We should separate metadata (texts) and media (blobs). Texts may be searchable and may need to be present in snapshot, and also we need to extract links. A good option may be document DB like MongoDB. Note that the data scale is large, we need partition. Also note that partition can help fitting indexes into memory for fast lookup and write.

Blobs are less frequently visited (in our case, when users retrieves an archive), we can choose an object storage like AWS S3. S3 can easily scale out, and support tiering by moving old content to cold storage.

High-level design

See high level diagram

Request flows

Query web rank services to get website ranking
Get a list of seed urls
Start with highest ranked urls, send requests to job queues
for each queue, take the first url, check if it was visited before.
If visited, discard and continue to next url
If not visited, download content
as an async job, store metadata and blobs to storage
extract links from the downloaded content
put links back to the urls pool
repeat step 3-6

Detailed component design

We use queues to achieve async ops, so as not to block critical tasks. The job queues help that link extraction and scheduling and downloader do not block each other. The data queues help link extractor and content storing not block each other.

We initially get page ranking, so that we can extract most popular sites first. For each site, we can start with DFS. However, this may impose high traffic stress on a single site. Our scheduler should have rate limiting algorithm per site basis. If a site has been visited many times, the new links from it should be penalize and thus lower ranked.

On politeness, in addition to rate limit, we should also respect robots.txt. Disallowed subdomains shall not be added to our jobs.

For DB, we shall have replicas for preventing data loss. For partition, we can do based on site domains, because user typically visits archives per domain basis.

Trade offs/Tech choices

See previous sections.

Failure scenarios/bottlenecks

Our scheduling service may be bottleneck, due to so many links, computational heaviness on page ranking and rate limit penalization. We may have many replicas, each of which handles a subset of domains. Caveats:

Schedulers need to know what domains other schedulers are responsible for, so that they could forward jobs that don't belong to them.
We could use consistent hashing on domain names. But note that some domains have much more contents than others. We may need some additional site statistics to better evenly distribute traffic.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?