Discover and download contents from the global internet
Scalability: huge number of websites to be crawled
Reliability: some links are malicious and some contents are noise
Assume we have 1B pages to crawl each week and each one is large as 100KB and the retention period is 3yr.
For this problem, API design isn't critical
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
We mainly relies on BFS to discover all the website globally
Politiness
We need to avoid being treated as DDOS attack from the host. We map the urls from one host into one queue(N --> 1). Each queue is assigned to one worker thread to download. The worker thread can track if there is on-going request from that host. If yes, push that url to the end of queue.
Priority
Some websites are more important than others so we should prioritize some website than others in the url froniter queues
Worker thread pools
The main bottle necks is on the network IO so we need co-routine + multiple-threading to maximize the network bandwidth.
We will calculate a hash value to de-dup.
BFS vs DFS
We prefers BFS here as BFS can nicely handle politiness by evenly distribute traffics to multiple hosts while DFS can cause traffic spike
Geo locality: the downloading tasks can be assigned to the html downloader which are geologically close to the target server.