page size 100KB
compressed page size 30KB
Storage = 5* 10^9 * 100 * 10^3 = 500TB
Compressed Storage = 5* 10^9 * 0.3.*100 * 10^3 = 150TB
Bandwidth = 500TB/(30*24*60*60) = 1.6Gbps
Feed seed
POST api/v1/seed
{
urls: strings[]
}
Pause crawling
POST api/v1/pause
Get URL crawl status
GET /api/v1/{url}/status
NOSQL DynamoDB for URLs
Seen table
{
host: string
url: string
url_hash: string
content_ref: string
content_hash: string
status: enum (PENDING, STARTED, COMPLETED)
}
SQL DB for metadata ()
metadata table
{
domain: string:
robots.txt: string
last_crawl_time: timestamp
}
Frontier DB {
url: string
status: enum (NEW, SENT)
priority: int
next_crawl_time: timestamp
}
URL Queue
{
url: string
}
Blob storage for web pages
Scheduler Service is responsible for setting next urls that needs to be fetched by priority politeness and other constraints Then, it will push those urls to the Frontier Queue
Worker Nodes will fetch URLs from the Frontier Queue and then fetch those web pages from the internet
DNS server will store DNS domains locally within the system so it won't be needed to be fetched multitple times
The Extractor Service extracts url outlinks from web pages. It checks for duplicates and if it's a url that wasn't seen before then it saves it in seen url db and scheduler queue
Initially, the user will feed seed urls using the /api/v1/seed api. The scheduler will pick up those and other extracted urls from the Frontier DB and push them to the Frontier Queue when they are needed to be feched. The scheduler will look at the robots.txt and priority, modification frequency, and politeness constraints and according to that will decide when to push to the Frontier Queue.
Then, the worker nodes will fetch URLs from the Frontier Queue and request for the host ip address from the dns server, fetch the web pages from the internet, and save it the blob storage and the extractor queue.
Then, the extractor service will fetch it from the queue the extract from the web page all of its url outlinks. Then it will check for duplicates and if the urls that are not duplicates the exctractor will save them in the seen url db and scheduler queue.
Then again the scheduler will read from the scheduler queue and save it and the Frontier DB
Scheduler
The schduler will read urls from the frontier db and decide according to the url's priority, domain, and politeness when to push the url to the frontier queue. The frontier queue is sharded based on domain and so is the worker node. Thus, each domain will be processed by one worker nodes. That makes it easy to keep politeness constraints
Worker Node
The worker nodes will be sharded according to domain so it will be easy to keep the politeness constraints. The worker node will request the host ip address from the system's dns server to save dns retrieval time. Then, it will fetch the HTML page and save it in the blob storage. The frontier queue and worker nodes are sharded by domain, so they are easy to scale. The blob storage component also provides scalable and durable with replicable storage. In case a worker node fails, we can easily spin up another worker node and connect it to the failed node's frontier queue. Also, for any failure to fetch an html page, the url will remain in the queue until it's acknoledged.
The Extractor will parse the web pages and extract url outlinks. It will normalize the url (remove sort parmaters and trailing slashes), compute its hash, and look for the same hash in the seen url cache and db. The extractor will also compute the web page's hash and look for match in cache and db to find to a duplicate web page with different url.
The extractor service can be scaled easily since it's stateless. If the average cpu exceeds a certain amount, new extractor nodes will be added to fetch the remaining web pages from the extractor queue.
There is a trade off between politeness and effieciency: the more polite the service is to domains, the less effiecient it is to fetch web pages from the internet. In my design, I favor politeness and this is why the queues and workers are sharded by domain. Favouring politeness is to prevent legal/ethical issues from overloading domains. We can balance that with a hybrid approach by negotiating with certain domains so we can fetch them more frequently.
The scheduler will use the robots.txt, priority, and update frequency, to set the next timestamp for a url to be crawled. Using this timestamp, the scheduler will query urls that needs to be fetched
All the components in my design are scalable: queues are sharded by domain, worker nodes are sharded by domain and stateless, extractors are stateless and communicate via queues so it's replaced in case of a failure, Metadata DB and Seen URLs DB are implemented using NOSQL DynamoDB, which provides auto scaling in accordance to load. The Frontier DB will be a SQL DB but since at a moment in time we should not have more than a few millions urls, one master DB and two slave replicas should be enough. We can use a garbage collecter to delete urls from the Frontier DB that were scheduled.
Centralized vs Distributed Frontier: The frontier queue can easily store millions of urls on one node, but then it will become a bottleneck and also a single point of failure. A single node queue is less complex, but that comes at the cost of reduced availability in case of crashes, and lower read/write bandwidth.
Sharding by Domain vs Uniform Distribution: sharding the frontier queue and worker nodes by domain will support forcing politeness constraints on domains, but will be less efficient than uniformly distributing the urls accross servers, which will enable server with low cpu utilization to fetch more URLs.
Worker Node Crash: is a worker crashes in the middle of fetching a url, the url will be visible after some time in the frontier queue because it wasn't acknoledged. Then, a new worker node can pick it up and process it.
If any of the queues in the system fail, then a slave replica will be promoted to master and server the services.
Trap links: the service should have a counter of how many requests it made to a domain, and if it's over a threshold, then the service should throttle itself. Also, it should look for known link traps while processing a web page.
Some web pages are single page apps which requires javascript processing in order to get the web page's content. This can be a security concern, so it will need to be processed in a isolated protected environment.
The scheduler can be improved by using machine learning to better prioritize urls. We can use the web page content, domain, and robots.txt file to better understand if the website is popular and frequently changing like a news web page.