Design A Web Crawler - System Design

System requirements

Functional:

feed initial seed URLs
Crawl the web recursively
Respect robots.txt and politeness of a domain
Store fetched web pages
the system should detect web pages and url duplicates

Non-Functional:

Scalability - the system should be able to scrape billions of pages
fault tolerence - the system should continue to operate even if some components fail
Durable - the system should not lose saved web pages or extracted URLs
politeness - the system should adhere to politeness rules of a domain

5 billion pages a month

page size 100KB

compressed page size 30KB

Storage = 5* 10^9 * 100 * 10^3 = 500TB

Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB

Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps

API design

Feed seed

POST api/v1/seed

{

urls: strings[]

}

Pause crawling

POST api/v1/pause

Get URL crawl status

GET /api/v1/{url}/status

Database design

NOSQL DynamoDB for URLs

Seen table

{

host: string

url: string

url_hash: string

content_ref: string

content_hash: string

status: enum (PENDING, STARTED, COMPLETED)

}

SQL DB for metadata ()

metadata table

{

domain: string:

robots.txt: string

last_crawl_time: timestamp

}

Frontier DB {

url: string

status: enum (NEW, SENT)

priority: int

next_crawl_time: timestamp

}

URL Queue

{

url: string

}

Blob storage for web pages

High-level design

Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue

Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet

DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times

The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue

Request flows

Initially, the user will feed seed urls using the /api/v1/seed api. The scheduler will pick up those and other extracted urls from the Frontier DB and push them to the Frontier Queue when they are needed to be feched. The scheduler will look at the robots.txt and priority, modification frequency, and politeness constraints and according to that will decide when to push to the Frontier Queue.

Then, the worker nodes will fetch URLs from the Frontier Queue and request for the host ip address from the dns server, fetch the web pages from the internet, and save it the blob storage and the extractor queue.

Then, the extractor service will fetch it from the queue the extract from the web page all of its url outlinks. Then it will check for duplicates and if the urls that are not duplicates the exctractor will save them in the seen url db and scheduler queue.

Then again the scheduler will read from the scheduler queue and save it and the Frontier DB

Detailed component design

Scheduler

The schduler will read urls from the frontier db and decide according to the url's priority, domain, and politeness when to push the url to the frontier queue. The frontier queue is sharded based on domain and so is the worker node. Thus, each domain will be processed by one worker nodes. That makes it easy to keep politeness constraints

Worker Node

The worker nodes will be sharded according to domain so it will be easy to keep the politeness constraints. The worker node will request the host ip address from the system's dns server to save dns retrieval time. Then, it will fetch the HTML page and save it in the blob storage. The frontier queue and worker nodes are sharded by domain, so they are easy to scale. The blob storage component also provides scalable and durable with replicable storage. In case a worker node fails, we can easily spin up another worker node and connect it to the failed node's frontier queue. Also, for any failure to fetch an html page, the url will remain in the queue until it's acknoledged.

The Extractor will parse the web pages and extract url outlinks. Then, it check for duplicates using the url's hash. It will also normalize the url. The extractor service can be scaled if the average cpu exceeds a certain amount. the new nodes will fetch the remaining web pages from the extractor queue.

Trade offs/Tech choices

Centralized vs Distributed Frontier: The frontier queue can easily store millions of urls on one node, but then it will become a bottleneck and also a single point of failure.

Sharding by Domain vs Uniform Distribution: sharding the frontier queue and worker nodes by domain will support forcing politeness constraints on domains, but will be less efficient than uniformly distributing the urls accross servers, which will enable server with low cpu utilization to fetch more URLs.

Failure scenarios/bottlenecks

Worker Node Crash: is a worker crashes in the middle of fetching a url, the url will be visible after some time in the frontier queue because it wasn't acknoledged. Then, a new worker node can pick it up and process it.

If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.

Trap links: the service should have a counter of how many requests it made to a domain, and if it's over a threshold, then the service should throttle itself. Also, it should look for known link traps while processing a web page.

Future improvements

Some web pages are single page apps which requires javascript processing in order to get the web page's content. This can be a security concern, so it will need to be processed in a isolated protected environment.

System requirements

Functional:

feed initial seed URLs
Crawl the web recursively
Respect robots.txt and politeness of a domain
Store fetched web pages
the system should detect web pages and url duplicates

Non-Functional:

Scalability - the system should be able to scrape billions of pages
fault tolerence - the system should continue to operate even if some components fail
Durable - the system should not lose saved web pages or extracted URLs
politeness - the system should adhere to politeness rules of a domain

5 billion pages a month

page size 100KB

compressed page size 30KB

Storage = 5* 10^9 * 100 * 10^3 = 500TB

Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB

Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps

API design

Feed seed

POST api/v1/seed

{

urls: strings[]

}

Pause crawling

POST api/v1/pause

Get URL crawl status

GET /api/v1/{url}/status

Database design

NOSQL DynamoDB for URLs

Seen table

{

host: string

url: string

url_hash: string

content_ref: string

content_hash: string

status: enum (PENDING, STARTED, COMPLETED)

}

SQL DB for metadata ()

metadata table

{

domain: string:

robots.txt: string

last_crawl_time: timestamp

}

Frontier DB {

url: string

status: enum (NEW, SENT)

priority: int

next_crawl_time: timestamp

}

URL Queue

{

url: string

}

Blob storage for web pages

High-level design

Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue

Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet

DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times

The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue

Request flows

Then again the scheduler will read from the scheduler queue and save it and the Frontier DB

Detailed component design

Scheduler

Worker Node

Trade offs/Tech choices

Centralized vs Distributed Frontier: The frontier queue can easily store millions of urls on one node, but then it will become a bottleneck and also a single point of failure.

Failure scenarios/bottlenecks

If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.