Design A Web Crawler - System Design

System requirements

Functional:

start from a seed url, crawl subsequent urls and store content in database
for duplicate url, only the first time we crawl it. Subsequent visits are treated as duplicates and ignored
respect robots.txt and other rules, only crawl contents are allowed

Non-Functional:

many contents are dynamic. Assume we crawl on a weekly basis
assume we keep one year of history

Capacity estimation

Assume we'll get 1 B pages per crawl schedule. Assume an average page has 20 KB in text. 20% of pages have media, and average media size is 10 MB.

One crawl cycle = 1 B * (1* 20 KB + 20% * 10 MB) = 20 TB text and 2 PB media

one year of history = * 52 = 1 PB text and 100 PB media

20 TB / 86400 / 7 = 33 MB per second.

We need multiple servers to crawl and multiple DB nodes to store data.

API design

int get() to download url html content. Returns http status code

list extract() to extract media and urls from the content. Returns a list of adjacent urls and media

int download() to download media. Returns status code

int write(, ) to write hypertext of the url to database. Returns status code

Database design

Separate media and hypertext. Media requires extra tags to support indexing and searching. They also require thumbnails for different resolutions and frame rates for preview and tiered access, if applicable. Hypertext can be reverse indexed to support searching. Store media in blob DB and hypertext in document DB, e.g., MongoDB.

We keep history of previous crawls. Depend on how frequent we need to visit history, we use tiered storage for current WIP crawl and historical crawls. For instance, older-than-a-month crawls may be moved to cold storage to save cost.

High-level design

See diagram

Request flows

Add seed URL to job queue and start
URL fetcher to fetch the next available URL from the job queue, and checks against dedup cache if it was visited. If yes, skip it and fetch the next URL; if not, add to cache and go download the web page.
URL extracter to extract adjacent URLs in the web page content, and also find all media links. It adds new URLs to the job queue, add hypertext content to the hypertext queue, and add media links to the media queue.
Hypertext worker grabs hypertext content from the hypertext queue, and write to document DB.
Media worker grabs media link from the media queue, and download the media blob.
Media processor gets downloaded media and process thumbnails, frame rates, etc. It writes the medias to blob DB.
Archiving service runs in background to archive older content history to cold storage.

Detailed component design

URL Fetcher, Extracter; Hypertext Writer; Media Downloader, Processor

We need multiple nodes for each of these services. Note that one physical node may perform multiple tasks in parallel because of multi-core, as long as network bandwidth and disk throughput allow. This will improve overall crawling throughput.

URL queue, Hypertext queue, Media queue

As the service nodes are multiple, these queues need to be MPMC (multi producer multi consumer) queues. They don't have to follow strict ordering as it is not important if one url must be fetched before or after another specific url. However, either a lock shall be present or a lockfree design shall be used, to avoid different nodes writing or reading the same slot in the queue.

DB partitions

DB content size is huge, we partition into multiple DB nodes. Such a partition may be based on the base url domain, or checksum of the content, or system-generated uid of the content, etc.

Trade offs/Tech choices

Queues

We may need a leader-follower pattern, in case the main queue fails and we lose data. The follower queue is not accessed by service nodes, but is self replicating the main queue's content.

DB partitions

If we partition by url base domains, certain partition may be excessive in size because the website scope is huge (e.g. wikipedia), and accessing content from the domain will result in a hot node. The pros is that all content from the same domain are stored together, faster and easier for indexing and searching. If we partition by checksum, random distribution of content is achieved and less likely we will have a hot node. However, indexing and searching content from the same domain will require joining results from multiple nodes, which is extra latency and implementation complexity.

Service nodes

Similar pros and cons regarding we have leader-follower patterns in service nodes, such as URL fetcher. In addition, downloading many contents from the same domain may overstress that domain server, causing the domain slow to respond to regular internet traffic or even bring it down. A mitigation is that we may start with a list of seed URLs and let each service node work on different domains.

Failure scenarios/bottlenecks

As mentioned above, we have replicas and leader-follower patterns almost every place in the design. If one node fails, a follower node will be elected as the new leader.

The media downloader and processor may be throughput bottleneck, as it requires a lot of network bandwidth (for downloader) and computing power (processor). This may make the media queue become full. In such case, we could introduce more media service nodes for larger parallelism.

Future improvements

We may want to decide which URLs to download first, depending on site ranking or other factors.

We may want to finish crawling more quickly, in a day or two, in which case we need to horizontally scale our system much more. However, this means we also stress the internet more. Make sure we should find a balance before doing so.

System requirements

Functional:

start from a seed url, crawl subsequent urls and store content in database
for duplicate url, only the first time we crawl it. Subsequent visits are treated as duplicates and ignored
respect robots.txt and other rules, only crawl contents are allowed

Non-Functional:

many contents are dynamic. Assume we crawl on a weekly basis
assume we keep one year of history

Capacity estimation

Assume we'll get 1 B pages per crawl schedule. Assume an average page has 20 KB in text. 20% of pages have media, and average media size is 10 MB.

One crawl cycle = 1 B * (1* 20 KB + 20% * 10 MB) = 20 TB text and 2 PB media

one year of history = * 52 = 1 PB text and 100 PB media

20 TB / 86400 / 7 = 33 MB per second.

We need multiple servers to crawl and multiple DB nodes to store data.

API design

int get() to download url html content. Returns http status code

list extract() to extract media and urls from the content. Returns a list of adjacent urls and media

int download() to download media. Returns status code

int write(, ) to write hypertext of the url to database. Returns status code

Database design

High-level design

See diagram

Request flows

Add seed URL to job queue and start
URL fetcher to fetch the next available URL from the job queue, and checks against dedup cache if it was visited. If yes, skip it and fetch the next URL; if not, add to cache and go download the web page.
URL extracter to extract adjacent URLs in the web page content, and also find all media links. It adds new URLs to the job queue, add hypertext content to the hypertext queue, and add media links to the media queue.
Hypertext worker grabs hypertext content from the hypertext queue, and write to document DB.
Media worker grabs media link from the media queue, and download the media blob.
Media processor gets downloaded media and process thumbnails, frame rates, etc. It writes the medias to blob DB.
Archiving service runs in background to archive older content history to cold storage.

Detailed component design

URL Fetcher, Extracter; Hypertext Writer; Media Downloader, Processor

URL queue, Hypertext queue, Media queue

DB partitions

DB content size is huge, we partition into multiple DB nodes. Such a partition may be based on the base url domain, or checksum of the content, or system-generated uid of the content, etc.

Trade offs/Tech choices

Queues

We may need a leader-follower pattern, in case the main queue fails and we lose data. The follower queue is not accessed by service nodes, but is self replicating the main queue's content.

DB partitions

Service nodes

Failure scenarios/bottlenecks

As mentioned above, we have replicas and leader-follower patterns almost every place in the design. If one node fails, a follower node will be elected as the new leader.

Future improvements

We may want to decide which URLs to download first, depending on site ranking or other factors.