page size 100KB
compressed page size 30KB
Storage = 5* 10^9 * 100 * 10^3 = 500TB
Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB
Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps
Feed seed
POST api/v1/seed
{
urls: strings[]
}
Pause crawling
POST api/v1/pause
Get URL crawl status
GET /api/v1/{url}/status
NOSQL DynamoDB for URLs
Seen table
{
host: string
url: string
url_hash: string
content_ref: string
content_hash: string
status: enum (PENDING, STARTED, COMPLETED)
}
SQL DB for metadata ()
metadata table
{
domain: string:
robots.txt: string
last_crawl_time: timestamp
}
Frontier DB {
url: string
status: enum (NEW, SENT)
priority: int
next_crawl_time: timestamp
}
URL Queue
{
url: string
}
Blob storage for web pages
Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue
Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet
DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times
The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue
Initially, the user will feed seed urls using the /api/v1/seed api. The scheduler will pick up those and other extracted urls from the Frontier DB and push them to the Frontier Queue when they are needed to be feched. The scheduler will look at the robots.txt and priority, modification frequency, and politeness constraints and according to that will decide when to push to the Frontier Queue.
Then, the worker nodes will fetch URLs from the Frontier Queue and request for the host ip address from the dns server, fetch the web pages from the internet, and save it the blob storage and the extractor queue.
Then, the extractor service will fetch it from the queue the extract from the web page all of its url outlinks. Then it will check for duplicates and if the urls that are not duplicates the exctractor will save them in the seen url db and scheduler queue.
Then again the scheduler will read from the scheduler queue and save it and the Frontier DB
Scheduler
The schduler will read urls from the frontier db and decide according to the url's priority, domain, and politeness when to push the url to the frontier queue. The frontier queue is sharded based on domain and so is the worker node. Thus, each domain will be processed by one worker nodes. That makes it easy to keep politeness constraints
Worker Node
The worker nodes will be sharded according to domain so it will be easy to keep the politeness constraints. The worker node will request the host ip address from the system's dns server to save dns retrieval time. Then, it will fetch the HTML page and save it in the blob storage. The frontier queue and worker nodes are sharded by domain, so they are easy to scale. The blob storage component also provides scalable and durable with replicable storage. In case a worker node fails, we can easily spin up another worker node and connect it to the failed node's frontier queue. Also, for any failure to fetch an html page, the url will remain in the queue until it's acknoledged.
The Extractor will parse the web pages and extract url outlinks. Then, it check for duplicates using the url's hash. It will also normalize the url. The extractor service can be scaled if the average cpu exceeds a certain amount. the new nodes will fetch the remaining web pages from the extractor queue.
Centralized vs Distributed Frontier: The frontier queue can easily store millions of urls on one node, but then it will become a bottleneck and also a single point of failure.
Sharding by Domain vs Uniform Distribution: sharding the frontier queue and worker nodes by domain will support forcing politeness constraints on domains, but will be less efficient than uniformly distributing the urls accross servers, which will enable server with low cpu utilization to fetch more URLs.
Worker Node Crash: is a worker crashes in the middle of fetching a url, the url will be visible after some time in the frontier queue because it wasn't acknoledged. Then, a new worker node can pick it up and process it.
If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.
Trap links: the service should have a counter of how many requests it made to a domain, and if it's over a threshold, then the service should throttle itself. Also, it should look for known link traps while processing a web page.
Some web pages are single page apps which requires javascript processing in order to get the web page's content. This can be a security concern, so it will need to be processed in a isolated protected environment.