MVP:
Given seed URLs, be able to recursively crawl the corresponding webpages and their contained URLs
Seed URLs can be provided with APIs
Users can specify the depth of extraction
the system will focus on the static contents, like the HTML
Content will be stored up to 5 years
scalability: should be able to scale up
performance: the system should be efficient
durability: extracted data should be stored
politeness: the system should respect website's crawl policy
1 billion pages crawled per day
1 000 000 000 / 100000 = 10000 urls / sec
Storage
100kb content per page
1 000 000 000 * 100 kb = 100 TB / day
submit_seed_url( seed_urls, is_repeated, max_depth_allowed)
is_repeated is an optional field, if specified, we will treat it as a scheduled job. By default we can crawl such url once per day.
max_depth_allowed is an optional field, it tells system how deep we want to crawl this given url
crawl task table
task_id
seed_id
scheduled_runtime
crawl metadata table (raw and clean)
url
s3_path
created_timestamp
Users/system can submit crawl job by calling the submit_sed_url() API
If it is repeated crawl job, the submission service will create an entry in the crawl task table with scheduled_runtime, and the task scheduler will regularly scan the table and create crawl tasks
If it is an instance crawl job, the submission service will send the request to the preprocessing service
preprocessing service is responsible for dedupe, prioritization, or skip urls based on blocklist
dedupe means we dont want to visit duplicate url within a same time period, dedupe can be based on bloom filter, prioritization means we can assign each task an priority, higher priority task will be executed first. URLs that exceeds the depth will also be filtered or the parser will not create new crawl job for those URLs that already reach the depth
After preprocessing, the preprocessing service will create a crawl job message and send it to a queue service.
The message queue contains information like url, priority, depth
We can have multiple queues for different priorities. Queue with higher priority will be processed with priority
On the other side of queue, we have crawlers that consume those messages and do the actual crawl job
The crawled data will be firstly stored in the raw data storage
The crawler will also create a message in a downstream queue. The message includes url and the path to raw data
On the other side of queue, we have parsers that validate the downloaded content, extract urls, then save the cleaned data in the cleaned data storage.
Parsers will also generate a list of urls from the crawled data and use submit_seed_url() API to submit new crawl jobs
To not overload a webpage, we will record the domain metadata in a separate table. That table will has the info when is the last extraction for a given domain. so when we create the job, we can have control the rate that we want to extract that domain.
This system is stateless, so we can horizontally scale the system by adding more service instances or workers to the system
The queue service can be Google PubSub, when a message is being processed, it will only be consumed by one worker. If the process is succeeded, the consumer will ack the message.
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Explain any trade offs you have made and why you made certain tech choices...
Try to discuss as many failure scenarios/bottlenecks as possible.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?