Design A Web Crawler - System Design

System requirements

Functional:

feed initial seed URLs
Crawl the web recursively
Respect robots.txt and politeness of a domain
Store fetched web pages
the system should detect web pages and url duplicates

Non-Functional:

Scalability - the system should be able to scrape billions of pages
fault tolerence - the system should continue to operate even if some components fail
Durable - the system should not lose saved web pages or extracted URLs
politeness - the system should adhere to politeness rules of a domain

5 billion pages a month

page size 100KB

compressed page size 30KB

Storage = 5* 10^9 * 100 * 10^3 = 500TB

Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB

Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps

API design

Feed seed

POST api/v1/seed

{

urls: strings[]

}

Pause crawling

POST api/v1/pause

Get URL crawl status

GET /api/v1/{url}/status

Database design

NOSQL DynamoDB for URLs

Seen table

{

host: string

url: string

url_hash: string

content_ref: string

content_hash: string

status: enum (PENDING, STARTED, COMPLETED)

}

SQL DB for metadata ()

metadata table

{

domain: string:

robots.txt: string

last_crawl_time: timestamp

}

Frontier DB {

url: string

status: enum (NEW, SENT)

priority: int

next_crawl_time: timestamp

}

URL Queue

{

url: string

}

Blob storage for web pages

High-level design

Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue

Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet

DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times

The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue

Request flows

Initially, the user will feed seed urls using the /api/v1/seed api. The scheduler will pick up those and other extracted urls from the Frontier DB and push them to the Frontier Queue when they are needed to be feched. The scheduler will look at the robots.txt and priority, modification frequency, and politeness constraints and according to that will decide when to push to the Frontier Queue.

Then, the worker nodes will fetch URLs from the Frontier Queue and request for the host ip address from the dns server, fetch the web pages from the internet, and save it the blob storage and the extractor queue.

Then, the extractor service will fetch it from the queue the extract from the web page all of its url outlinks. Then it will check for duplicates and if the urls that are not duplicates the exctractor will save them in the seen url db and scheduler queue.

Then again the scheduler will read from the scheduler queue and save it and the Frontier DB

Detailed component design

Scheduler

The schduler will read urls from the frontier db and decide according to the url's priority, domain, and politeness when to push the url to the frontier queue. The frontier queue is sharded based on domain and so is the worker node. Thus, each domain will be processed by one worker nodes. That makes it easy to keep politeness constraints

Worker Node

The worker nodes will be sharded according to domain so it will be easy to keep the politeness constraints. The worker node will request the host ip address from the system's dns server to save dns retrieval time. Then, it will fetch the HTML page and save it in the blob storage. The frontier queue and worker nodes are sharded by domain, so they are easy to scale. The blob storage component also provides scalable and durable with replicable storage. In case a worker node fails, we can easily spin up another worker node and connect it to the failed node's frontier queue. Also, for any failure to fetch an html page, the url will remain in the queue until it's acknoledged.

The Extractor will parse the web pages and extract url outlinks. It will normalize the url (remove sort parmaters and trailing slashes), compute its hash, and look for the same hash in the seen url cache and db. The extractor will also compute the web page's hash and look for match in cache and db to find to a duplicate web page with different url.

The extractor service can be scaled easily since it's stateless. If the average cpu exceeds a certain amount, new extractor nodes will be added to fetch the remaining web pages from the extractor queue.

There is a trade off between politeness and effieciency: the more polite the service is to domains, the less effiecient it is to fetch web pages from the internet. In my design, I favor politeness and this is why the queues and workers are sharded by domain. Favouring politeness is to prevent legal/ethical issues from overloading domains. We can balance that with a hybrid approach by negotiating with certain domains so we can fetch them more frequently.

The scheduler will use the robots.txt, priority, and update frequency, to set the next timestamp for a url to be crawled. Using this timestamp, the scheduler will query urls that needs to be fetched

All the components in my design are scalable: queues are sharded by domain, worker nodes are sharded by domain and stateless, extractors are stateless and communicate via queues so it's replaced in case of a failure, Metadata DB and Seen URLs DB are implemented using NOSQL DynamoDB, which provides auto scaling in accordance to load. The Frontier DB will be a SQL DB but since at a moment in time we should not have more than a few millions urls, one master DB and two slave replicas should be enough. We can use a garbage collecter to delete urls from the Frontier DB that were scheduled.

Trade offs/Tech choices

Centralized vs Distributed Frontier: The frontier queue can easily store millions of urls on one node, but then it will become a bottleneck and also a single point of failure. A single node queue is less complex, but that comes at the cost of reduced availability in case of crashes, and lower read/write bandwidth.

Sharding by Domain vs Uniform Distribution: sharding the frontier queue and worker nodes by domain will support forcing politeness constraints on domains, but will be less efficient than uniformly distributing the urls accross servers, which will enable server with low cpu utilization to fetch more URLs.

Failure scenarios/bottlenecks

Worker Node Crash: is a worker crashes in the middle of fetching a url, the url will be visible after some time in the frontier queue because it wasn't acknoledged. Then, a new worker node can pick it up and process it.

If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.

Trap links: the service should have a counter of how many requests it made to a domain, and if it's over a threshold, then the service should throttle itself. Also, it should look for known link traps while processing a web page.

Future improvements

Some web pages are single page apps which requires javascript processing in order to get the web page's content. This can be a security concern, so it will need to be processed in a isolated protected environment.

The scheduler can be improved by using machine learning to better prioritize urls. We can use the web page content, domain, and robots.txt file to better understand if the website is popular and frequently changing like a news web page.

System requirements

Functional:

feed initial seed URLs
Crawl the web recursively
Respect robots.txt and politeness of a domain
Store fetched web pages
the system should detect web pages and url duplicates

Non-Functional:

Scalability - the system should be able to scrape billions of pages
fault tolerence - the system should continue to operate even if some components fail
Durable - the system should not lose saved web pages or extracted URLs
politeness - the system should adhere to politeness rules of a domain

5 billion pages a month

page size 100KB

compressed page size 30KB

Storage = 5* 10^9 * 100 * 10^3 = 500TB

Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB

Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps

API design

Feed seed

POST api/v1/seed

{

urls: strings[]

}

Pause crawling

POST api/v1/pause

Get URL crawl status

GET /api/v1/{url}/status

Database design

NOSQL DynamoDB for URLs

Seen table

{

host: string

url: string

url_hash: string

content_ref: string

content_hash: string

status: enum (PENDING, STARTED, COMPLETED)

}

SQL DB for metadata ()

metadata table

{

domain: string:

robots.txt: string

last_crawl_time: timestamp

}

Frontier DB {

url: string

status: enum (NEW, SENT)

priority: int

next_crawl_time: timestamp

}

URL Queue

{

url: string

}

Blob storage for web pages

High-level design

Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue

Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet

DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times

The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue

Request flows

Then again the scheduler will read from the scheduler queue and save it and the Frontier DB

Detailed component design

Scheduler

Worker Node

The scheduler will use the robots.txt, priority, and update frequency, to set the next timestamp for a url to be crawled. Using this timestamp, the scheduler will query urls that needs to be fetched

Trade offs/Tech choices

Failure scenarios/bottlenecks

If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.