List functional requirements for the system (Ask the chat bot for hints if stuck.)...
Download web pages and save to database, currently only focus on html content
Politeness: follow robots.txt, control concurrency and speed
Avoid crawling duplicate pages
could parse title, metadata and other content
User could assign entry urls
Support dynamic content loading using selenium
List non-functional requirements for the system...
Robustness: could run stably and recover from error
Efficiency: Optimize the crawler to maximize the pages crawled per unit time. This may involve efficient scheduling of crawl jobs and load balancing across multiple servers
Scalability: could scale horizontally, accommodate increasing URLs and handle more extensive web crawls
Durability: Ensure that once data is stored, it is not lost. Implement data replication and backup strategies, particularly for critical data like crawled pages and queue states.
Estimate the scale of the system you are going to design...
Suppose each crawler thread could crawl 0.5page/second, and we have 1000 workers, each worker has 10 crawler threads, each html size saved to database is 10kb in avg
And we need to store 5-year data, then we need 1000 * 10 * 0.5 * 24 * 60 * 60 * 10kb = ~7PB data in total
Define what APIs are expected from the system...
api: /api/crawl
method: post
data: {"url": "xxx", "depth": 3, "priority": "xxx"}
description: crawl from a given url
return: {"status": "success", "job_id": "xxx"}
api: /api/crawl/{job_id}/status
method: get
data: job id
description: get current status of the crawler
return: {"status": "xxx", "finished_urls": [xxx], "ungoing_urls": [xxx], "pending_urls": [xxx]}
Also the apis need to support features like authentication and rate limitation
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
I will use mysql as database, it needs below tables
web_pages
id INT AUTO_INCREMENT PRIMARY KEY
content LONGTEXT
metadata JSON
title VARCHAR(255)
url VARCHAR(3000) UNIQUE NOT NULL
created_time DATETIME
update_time DATETIME
jobs
id INT AUTO_INCREMENT PRIMARY KEY
status VARCHAR(20)
priority VARCHAR(20)
url VARCHAR(3000) UNIQUE NOT NULL
retry_num INTUNSIGNED
created_time DATETIME
update_time DATETIME
As we need to store ~7PB data, we could use sharding with id.
Because we only need to store html data, we use LONGTEXT as the content column type
Also we could use nosql database like mongodb or cassandra
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Explain any trade offs you have made and why you made certain tech choices...
Try to discuss as many failure scenarios/bottlenecks as possible.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?