Design A Web Crawler - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Download web pages and save to database, currently only focus on html content

Politeness: follow robots.txt, control concurrency and speed

Avoid crawling duplicate pages

could parse title, metadata and other content

User could assign entry urls

Support dynamic content loading using selenium

Non-Functional:

List non-functional requirements for the system...

Robustness: could run stably and recover from error

Efficiency: Optimize the crawler to maximize the pages crawled per unit time. This may involve efficient scheduling of crawl jobs and load balancing across multiple servers

Scalability: could scale horizontally, accommodate increasing URLs and handle more extensive web crawls

Durability: Ensure that once data is stored, it is not lost. Implement data replication and backup strategies, particularly for critical data like crawled pages and queue states.

Capacity estimation

Estimate the scale of the system you are going to design...

Suppose each crawler thread could crawl 0.5page/second, and we have 1000 workers, each worker has 10 crawler threads, each html size saved to database is 10kb in avg

And we need to store 5-year data, then we need 1000 * 10 * 0.5 * 24 * 60 * 60 * 10kb = ~7PB data in total

API design

Define what APIs are expected from the system...

api: /api/crawl

method: post

data: {"url": "xxx", "depth": 3, "priority": "xxx"}

description: crawl from a given url

return: {"status": "success", "job_id": "xxx"}

api: /api/crawl/{job_id}/status

method: get

data: job id

description: get current status of the crawler

return: {"status": "xxx", "finished_urls": [xxx], "ungoing_urls": [xxx], "pending_urls": [xxx]}

Also the apis need to support features like authentication and rate limitation

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

I will use mysql as database, it needs below tables

web_pages

id INT AUTO_INCREMENT PRIMARY KEY

content LONGTEXT

metadata JSON

title VARCHAR(255)

url VARCHAR(3000) UNIQUE NOT NULL

created_time DATETIME

update_time DATETIME

jobs

id INT AUTO_INCREMENT PRIMARY KEY

status VARCHAR(20)

priority VARCHAR(20)

url VARCHAR(3000) UNIQUE NOT NULL

retry_num INTUNSIGNED

created_time DATETIME

update_time DATETIME

As we need to store ~7PB data, we could use sharding with id.

Because we only need to store html data, we use LONGTEXT as the content column type

Also we could use nosql database like mongodb or cassandra

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Download web pages and save to database, currently only focus on html content

Politeness: follow robots.txt, control concurrency and speed

Avoid crawling duplicate pages

could parse title, metadata and other content

User could assign entry urls

Support dynamic content loading using selenium

Non-Functional:

List non-functional requirements for the system...

Robustness: could run stably and recover from error

Efficiency: Optimize the crawler to maximize the pages crawled per unit time. This may involve efficient scheduling of crawl jobs and load balancing across multiple servers

Scalability: could scale horizontally, accommodate increasing URLs and handle more extensive web crawls

Durability: Ensure that once data is stored, it is not lost. Implement data replication and backup strategies, particularly for critical data like crawled pages and queue states.

Capacity estimation

Estimate the scale of the system you are going to design...

Suppose each crawler thread could crawl 0.5page/second, and we have 1000 workers, each worker has 10 crawler threads, each html size saved to database is 10kb in avg

And we need to store 5-year data, then we need 1000 * 10 * 0.5 * 24 * 60 * 60 * 10kb = ~7PB data in total

API design

Define what APIs are expected from the system...

api: /api/crawl

method: post

data: {"url": "xxx", "depth": 3, "priority": "xxx"}

description: crawl from a given url

return: {"status": "success", "job_id": "xxx"}

api: /api/crawl/{job_id}/status

method: get

data: job id

description: get current status of the crawler

return: {"status": "xxx", "finished_urls": [xxx], "ungoing_urls": [xxx], "pending_urls": [xxx]}

Also the apis need to support features like authentication and rate limitation

Database design

I will use mysql as database, it needs below tables

web_pages

id INT AUTO_INCREMENT PRIMARY KEY

content LONGTEXT

metadata JSON

title VARCHAR(255)

url VARCHAR(3000) UNIQUE NOT NULL

created_time DATETIME

update_time DATETIME

jobs

id INT AUTO_INCREMENT PRIMARY KEY

status VARCHAR(20)

priority VARCHAR(20)

url VARCHAR(3000) UNIQUE NOT NULL

retry_num INTUNSIGNED

created_time DATETIME

update_time DATETIME

As we need to store ~7PB data, we could use sharding with id.

Because we only need to store html data, we use LONGTEXT as the content column type

Also we could use nosql database like mongodb or cassandra

High-level design

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?