Design A Web Crawler - System Design

System requirements

Functional:

Crawling: The system should be able to discover and visit web pages using algorithms like Depth-First Search (DFS) or Breadth-First Search (BFS).
Content Downloading: The ability to download the content (HTML, images, videos, etc.) from the web pages that are visited.
Data Storage: Store the retrieved data in a structured format for easy access and retrieval later. This could include a database or a data lake.
Politeness/Respecting Robots.txt: The crawler should respect the robots.txt file to only crawl areas of the website that are allowed to be crawled.
Rate Limiting: Implement rate limiting to avoid overwhelming web servers when making requests.
Error Handling and Logging: Keep track of any errors during crawling and provide logs for diagnostics.
Scheduling: Ability to schedule crawl sessions at defined intervals to capture updated content.
Deduplication: Ensure that duplicate content is managed efficiently and only unique content is processed.

Non-Functional:

Scalability: The system should be able to handle an increasing volume of data and traffic without significant degradation in performance. This includes horizontal scaling (adding more machines) and vertical scaling (adding resources to existing machines).
Performance: The crawler should be optimized for speed and responsiveness. This includes the time it takes to fetch pages and process downloaded content.
Reliability: The system should be reliable, ensuring that it can run for long periods without failure and can recover quickly from any issues.
Maintainability: The system should be designed in a way that makes it easy to update or modify components without extensive rework. This includes clear documentation and modular design.
Security: The crawler must be designed to handle sensitive information appropriately and should not expose vulnerabilities that could be exploited by malicious users or be blocked by target websites.
Compliance: Ensure that the crawler complies with legal and ethical requirements, including respecting terms of service and privacy laws.
Efficiency: Optimize the use of system resources (CPU, memory, bandwidth) to ensure that the crawler runs efficiently.
Usability: The system should be user-friendly, providing intuitive interfaces for monitoring, configuration, and reporting.

Capacity estimation

initial estimation is 100 * 10 ^ 6 new urls daily

html content is 100kb per page

media is 500 kb per image

5 media per page

100 pages per url

with this numbers we can estimate QPS and size requirements:

1) QPS = 100 * 10 ^ 6 / 24 * 60 * 60 = 1157 QPS

2) Peark QPS = QPS * 2 = 2300 QPS

3) Total number of pages = 100 * 10 ^ 6 * 100 = 10 billion pages

4) total number of pages * 500 kb / 1024 / 1024 / 1024 = 4656 TB of storage for html

5) total number of pages * 5 * 600 kb / 1024 / 1024 /1024 = 27939 TB of storage for HTML

6)If one server can hold 1TB we need 5 servers for HTML and 28 servers for media

API design

GET /crawled/html

GET /crawled/media

POST /url

POST /url/batch

POST /url/schedule

POST /url/filter

POST /url/robots

POST /rate

GET /settings

Database design

URL <--Page

URL: + String URL PK FK

URL : +int WorkerId

URL : +DateTime LastUpdated INDEX

Page <--Media

Page : +String URL PK FK

Page : +String Html

Page : +int MediaKey

Page : +DateTime LastUpdated INDEX

Media : +int MediaKey PK FK

Media : +String MediaLink

Media : +DateTime LastUpdated INDEX

High-level design

1) Rate limiter - limits the number of user requests and maximum number of submitted urls per user using leaking bucket or queue algorithms

2\) URL frontier - a service that manages URL downloaders and parsers queues

3) HTML downloader - downloads url from the internert, contancts DNS cache and DNS service to get l3 address from the URL

4) COntent Parser - parses web page content and extracts media and links

5) Content filter - filters content based on user rules like adulkt only or country banned content

6) DuplicateHashSet - a set that checks if we have already seen page before, used to reject duplicates

7) HTML storage - a blob or document storage to store web pages

8) URL extractro - extracts URLS from a web page

9) URL duplicate hash set - a set to check if we have already seen a URL, used to reject already seen urls and to not crawl them 2 times

10) URL storage - stores already seen URLS

11) Media storage - a blob storage for media

Request flows

1) A user submits a set of seed (starting URLS) to crawl)

2) Rate limiter checks if a URLs are within a limit range

3) If success rate limiter passes urls to a URL frontier

4) URL frontier computes priority of the link based on the keywrods inside the link, then it chooses a queue worker among html downloaders based on hash(link) % workerCount

5) a URL is passed to the HTML downloader worker which issues a networking request to DNS (if the DNS cache is empty), downloads the contents, parses it with COntent parser and filters with ContentFilter

6) Parsed content is passed to the DuplicateHashSet which computes the hash of the contents and checks if we have already seen a web page. If yes, and the downloaded date difference is smaller then the predefined param we pass the page futher if not the page is rejected as duplicated

7) Page HTML contents is saved in HtmlStorage

8) Page Media contents is saved in MediaStorage

9) Page URLS are extracted and passed to the UrlDuplicateHSash set that computes the hash of the links and checks if we have seen the url. If we havent the URL is passed further is we have the URL is rejected

When extarcting new URLs the system respects robots.txt and ignores url in #ignore section

10) URL is saved in URL storage

11) New extracted URLS are passed back to URL frontier in recursive manner to continue crawling

12) Notification service notifies client about operation completion

Detailed component design

1) URL frontier service is responsible for scheduling workers that crawl Internet for pages It consists of the following:

QUeue router: to insure politeness (not overwhelm a host with too many crawl reqeuests) we maintain a queue for every host
To not overwhelm our crawler we scale our downloading service into multiple workers behind queues. If queue is full we wait until workers finish the job
To insure that prioriy links are scheduled first we compute a priority code from the link based on the keywords. Before the load balance we maintain a set of queus order by priority and links with the most priority gets the first queues in the list
Every worker has a content parser and content filter built in because it's on per-page basis. We perform parsing and filtering map-reduce style (map step)
Meidaa and HTML storages are gloabl for every worker as well as URL duplicate hash Set

2) For DNS every worker has a local DNS cache if it has a cache miss the worker contacts a global DNS cache for every worker. If it is a miss the worker contants a DNS server

3) When a url is fully downloaded the notification service notifies a user about success

Trade offs/Tech choices

We use queueing strategy to provides politenesss for hosts and also to distribute load among workers based on priority.
For media storage we use an blob storage like AMazon S3 and store metadata in SQL db because it's strctured.
For page data we use document storage that is horizontally scalable for writes like MongoDb

To notify the usre we use long polling strategy because the HTML downloader is a long process and we dont need real-time updates

For queuing system we can use Apache kafka with priority computation or our own solution

Care must be taken to access shared resources by multiple workes like a global DNS cache or DuplicateSets. For ease of implementation and not a big expected contention we choose a straightforward locking mechanism for the whole service.

3) To scale page and media storage we can shard the storage by region or an ip address range

Failure scenarios/bottlenecks

1) If a web crawler fails in the process of a download we can restart the worker with the same URL links

2) IN the case of data overflow we can spawn more workers

3) DFS algortihm is not used as the depth of links can go very deep for the internet, instead we use BFS

4) To prevent too deep recursive links we can limit the URL length or also maintain the already seen URL for every worker and check if we have already seen one

Future improvements

1) Add more sophisticated fiter to the page content, use ML to to filter the page

2) Log time took to process the page by a worker and add the data to monitoring subsystem to monitor system performance and have an adaptive strategy for a different hosts