Design A Web Crawler - System Design

System requirements

Functional:

Discover and download contents from the global internet

Non-Functional:

Scalability: huge number of websites to be crawled

Reliability: some links are malicious and some contents are noise

Capacity estimation

Assume we have 1B pages to crawl each week and each one is large as 100KB and the retention period is 3yr.

1.6k write QPS(peak 3k QPS)
15PB each year

API design

For this problem, API design isn't critical

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

High-level design

We mainly relies on BFS to discover all the website globally

Request flows

url seeds are added into the url frontiter
url frontier is a BFS queue to hold all the url candidates
url downloader downs the html contents by urls given by url frontier
content parser is responsible to parse the content
content de-dup/store stores the content if it is new
url parser the new urls from new contents
url de-dup/store stores the url if it is new
new urls are added back to frontiter

Detailed component design

Url froniter

Politiness

We need to avoid being treated as DDOS attack from the host. We map the urls from one host into one queue(N --> 1). Each queue is assigned to one worker thread to download. The worker thread can track if there is on-going request from that host. If yes, push that url to the end of queue.

Priority

Some websites are more important than others so we should prioritize some website than others in the url froniter queues

HTML Downloader

Worker thread pools

The main bottle necks is on the network IO so we need co-routine + multiple-threading to maximize the network bandwidth.

Redundant Content de-dup

We will calculate a hash value to de-dup.

Trade offs/Tech choices

BFS vs DFS

We prefers BFS here as BFS can nicely handle politiness by evenly distribute traffics to multiple hosts while DFS can cause traffic spike

Failure scenarios/bottlenecks

Future improvements

Geo locality: the downloading tasks can be assigned to the html downloader which are geologically close to the target server.