Design A Web Crawler - System Design

System requirements

Functional:

MVP:

Given seed URLs, be able to recursively crawl the corresponding webpages and their contained URLs

Seed URLs can be provided with APIs

Users can specify the depth of extraction

the system will focus on the static contents, like the HTML

Content will be stored up to 5 years

Non-Functional:

scalability: should be able to scale up

performance: the system should be efficient

durability: extracted data should be stored

politeness: the system should respect website's crawl policy

Capacity estimation

1 billion pages crawled per day

1 000 000 000 / 100000 = 10000 urls / sec

Storage

100kb content per page

1 000 000 000 * 100 kb = 100 TB / day

API design

submit_seed_url( seed_urls, is_repeated, max_depth_allowed)

is_repeated is an optional field, if specified, we will treat it as a scheduled job. By default we can crawl such url once per day.

max_depth_allowed is an optional field, it tells system how deep we want to crawl this given url

Database design

crawl task table

task_id

seed_id

scheduled_runtime

crawl metadata table (raw and clean)

url

s3_path

created_timestamp

High-level design

Users/system can submit crawl job by calling the submit_sed_url() API

If it is repeated crawl job, the submission service will create an entry in the crawl task table with scheduled_runtime, and the task scheduler will regularly scan the table and create crawl tasks

If it is an instance crawl job, the submission service will send the request to the preprocessing service

preprocessing service is responsible for dedupe, prioritization, or skip urls based on blocklist

dedupe means we dont want to visit duplicate url within a same time period, dedupe can be based on bloom filter, prioritization means we can assign each task an priority, higher priority task will be executed first. URLs that exceeds the depth will also be filtered or the parser will not create new crawl job for those URLs that already reach the depth

After preprocessing, the preprocessing service will create a crawl job message and send it to a queue service.

The message queue contains information like url, priority, depth

We can have multiple queues for different priorities. Queue with higher priority will be processed with priority

On the other side of queue, we have crawlers that consume those messages and do the actual crawl job

The crawled data will be firstly stored in the raw data storage

The crawler will also create a message in a downstream queue. The message includes url and the path to raw data

On the other side of queue, we have parsers that validate the downloaded content, extract urls, then save the cleaned data in the cleaned data storage.

Parsers will also generate a list of urls from the crawled data and use submit_seed_url() API to submit new crawl jobs

To not overload a webpage, we will record the domain metadata in a separate table. That table will has the info when is the last extraction for a given domain. so when we create the job, we can have control the rate that we want to extract that domain.

This system is stateless, so we can horizontally scale the system by adding more service instances or workers to the system

The queue service can be Google PubSub, when a message is being processed, it will only be consumed by one worker. If the process is succeeded, the consumer will ack the message.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?