Design A Web Crawler - System Design

System requirements

Functional:

Web Crawling:
The system should be able to initiate HTTP requests to web pages and download their content.
It must support different protocols (HTTP, HTTPS) and handle redirects appropriately.
The system should support handling various content types such as HTML, images, and other media files.
URL Frontier Management:
Maintain a dynamic URL frontier to prioritize and manage the order in which URLs are crawled.
Implement a mechanism to add new URLs to the frontier and remove them once crawled.
Support for scheduling URLs based on predefined rules or priorities.
Robots.txt Compliance:
The system must fetch and interpret the robots.txt file for each domain to ensure compliance with crawling rules.
Respect rules specified in robots.txt, including crawl-delay and disallow directives.
Data Extraction:
Extract relevant information from web pages, including text, images, links, metadata, etc.
Support for parsing and extracting structured data formats such as HTML, XML, or JSON.
Content Deduplication:
Implement a mechanism to identify and eliminate duplicate content to minimize redundant crawling.
Use hash functions or other techniques to compare and identify duplicate content.
Crawl Depth Control:
Provide a mechanism to customize the depth of crawling, limiting how many levels deep the crawler should follow links.
Support for setting depth limits based on domain or URL patterns.
Data Storage:
Store crawled data in a structured format (database, file system, etc.).
Support for efficient retrieval and update operations for the stored data.
Regular Crawling:
Schedule and perform regular crawls to ensure data freshness.
Support for incremental crawling to update only modified or new content.
Prevention of Infinite Loop:
Implement mechanisms to prevent the crawler from getting stuck in an infinite loop. Maintain state information to track visited URLs and avoid revisiting them unnecessarily.

Non-Functional:

Performance:
The system should be able to crawl and download content from a large number of web pages within a reasonable time frame.
Efficient handling of parallel requests to optimize throughput.
Scalability:
The system must be scalable to handle the specified workload of 1 billion web pages per month.
Reliability:
The crawler should operate reliably under normal and adverse conditions, minimizing downtime.
Robustness:
The system should gracefully handle errors and unexpected situations to ensure continuous operation.
Security:
Ensure secure handling of crawled content and avoid potential security vulnerabilities.
Implement secure communication (HTTPS) when interacting with web servers.
Throttling:
Throttling mechanisms should be effective in preventing the crawler from overwhelming servers while maintaining efficient crawling.
Compliance:
The system must comply with legal and ethical standards for web crawling.
Respect privacy and legal restrictions on crawling certain types of content.
Monitoring and Logging:
Implement comprehensive monitoring and logging to track system performance, errors, and crawling statistics.
Provide tools for administrators to analyze and troubleshoot issues.

Capacity estimation

Assumptions

The system crawls 1 billion web pages per month
Each web page has 10 media files on average
An average media file size of 1MB

Storage Required for Web Pages:

Let us assume each web page on average has 100KB of data, so we need

Storage = Number of Pages * Average Page size * 12 months

Storage = 1 billion * 100 KB * 12

Storage = 1,000,000,000 * 1,00,000 * 12 = 1.2 PB

So, we would need 1.2 PB storage for 1 year

Storage required for media files

Since each web page can have 10 media files and each media file is 1 MB.

Storage = Number of Pages * 1 media file size * 10 files * 12 months

Storage = 1 billion * 1 MB * 10 * 12

Storage = 1,000,000,000 * 1,000,000 * 10 * 12 = 120 PB

So, we would need 120 PB storage for 1 year

API design

Below is the list of API's required for the system, although this might not be the exhaustive list, this provides a good starting point.

/api/addUrl:
Description: This API is used to add a new URL to the crawler's queue for processing.
Input: The input will be a JSON object containing the URL to be added.
Output: The output will be a response indicating the success or failure of the operation.
/api/getNextUrl:
Description: This API retrieves the next URL from the crawler's queue for crawling.
Input: No input parameters are required.
Output: The output is the URL to be crawled next or an indication that the queue is empty.
/api/crawlPage:
Description: Initiates the crawling process for a given URL.
Input: JSON object containing the URL and associated crawl parameters.
Output: Response indicating the success or failure of the crawling process.
/api/getCrawledData:
Description: Retrieves the crawled data for a specific URL.
Input: JSON object containing the URL or other identification parameters.
Output: The output is the crawled data, including text, images, links, etc., in a structured format.
/api/setCrawlDepth:
Description: Allows customization of the crawl depth for a specific URL or domain.
Input: JSON object containing the URL or domain and the desired crawl depth.
Output: Response indicating the success or failure of the operation.
/api/updateRobotsTxt:
Description: Updates the internal representation of robots.txt for a given domain.
Input: JSON object containing the domain and the updated robots.txt content.
Output: Response indicating the success or failure of the update.
/api/getCrawlStatus:
Description: Retrieves the current status of the crawling process.
Input: No input parameters are required.
Output: The output is a summary of the crawling status, including the number of pages crawled, errors encountered, etc.
/api/setThrottle:
Description: Allows dynamic adjustment of the crawling speed to prevent overloading servers.
Input: JSON object containing the throttle parameters (e.g., delay between requests).
Output: Response indicating the success or failure of the operation.
/api/checkCrawledStatus:
Description: Checks if a given URL has already been crawled.
Input: JSON object containing the URL.
Output: Response indicating whether the URL has been crawled or not.
/api/removeUrl:
Description: Removes a specific URL from the crawler's queue.
Input: JSON object containing the URL to be removed.
Output: Response indicating the success or failure of the removal operation.
/api/getMediaForUrl:
Description: Retrieves media files (images, videos, etc.) associated with a specific URL.
Input: JSON object containing the URL.
Output: The output is a list of media files with their respective URLs and metadata.

Database design

For tables required for this problem, refer the below diagram.

Database Choice

Relational Database for URL and Metadata:
Example: PostgreSQL or MySQL
CAP Focus: Balanced (Consistency and Availability)
Reasoning: Relational databases typically offer a balanced approach between consistency and availability. They ensure data integrity through ACID properties (Atomicity, Consistency, Isolation, Durability) while providing reasonable availability.
Document-oriented Database for HTML and CSS Content:
Example: MongoDB
CAP Focus: Balanced (Consistency and Availability)
Reasoning: Document-oriented databases, while supporting eventual consistency, are designed to provide a balance between consistency and availability. They are suitable for semi-structured data like HTML and CSS content.
Distributed File System for Media Content:
Example: Hadoop HDFS or Amazon S3
CAP Focus: Availability
Reasoning: Distributed file systems are optimized for availability and partition tolerance, making them suitable for storing large media files. In a web crawling scenario, ensuring availability for continuous crawling is critical.

Application of CAP theorem on the web crawler: Focus on Availability and Partition Tolerance:

Reasoning: In the context of a web crawler, availability is crucial to handle a large volume of concurrent requests and ensure continuous crawling even in the presence of network partitions. While consistency is essential, a slight delay in updating crawled data due to eventual consistency is acceptable as long as the system remains available for crawling.
The focus on availability and partition tolerance implies that the system might experience eventual consistency in the data store, meaning that updates to the database may take some time to propagate across the entire system. This trade-off allows the web crawler to continue functioning even in the face of network partitions or temporary database unavailability.

Data Partitioning Strategy:

Hash-Based Partitioning:
Reasoning: Hash-based partitioning is the best strategy for this problem, as it evenly distributes URLs across multiple servers, ensuring a balanced workload and efficient utilization of resources. This approach facilitates horizontal scalability and helps avoid hotspots, providing consistent and predictable performance for the web crawling system.
Algorithm: For URL-based hashing in a web crawler system, a cryptographic hash function such as SHA-256 (Secure Hash Algorithm 256-bit) can be a suitable choice.

Sharding Strategy:

Hash-Based Sharding:
Reasoning: Hash-based sharding is the best strategy for this problem, as it distributes URLs, crawled data, and media files across shards based on a consistent hash function. This approach ensures even distribution, minimizes hotspots, and facilitates horizontal scalability, allowing the web crawler system to efficiently handle a large volume of data with balanced workloads.

Read/Write Separation:

Implementing read/write separation is beneficial. Since web crawling involves heavy read operations for retrieving data, separating read and write operations can enhance performance, allowing the system to scale horizontally for read-intensive tasks without affecting write operations.

High-level design

Frontend Component:
Responsibility: Provides a user interface to the admin for interacting with the web crawler system, allowing users to initiate crawls using seed URLs, monitor progress, and configure crawling parameters.
Technologies: Web framework (e.g., React, Angular), API communication.
Crawler Controller:
Responsibility: Orchestrates the overall crawling process, manages the URL queue, and coordinates communication between components. This can be a scheduled or a queue based job which keeps polling for new urls in the queue.
Technologies: Node.js, Python, or another suitable backend language.
URL Queue:
Responsibility: Manages the queue of URLs to be crawled, enqueues new URLs, and dequeues URLs for processing.
Technologies: Distributed queue system (e.g., RabbitMQ, Apache Kafka).
Crawler Worker Nodes:
Responsibility: Actively crawls web pages, extracts data, and saves information to the database.
Technologies: Node.js, Python, or another suitable language for parallel processing.
URL Deduplication Service:
Responsibility: This service ensures that duplicate URLs are minimized to avoid redundant crawling.
Robots.txt Checker:
Responsibility: Checks robots.txt files for compliance with crawling rules before initiating the crawl. This will be used to define boundaries and rules for the crawl job.
Data Storage (Relational Database):
Responsibility: Stores URLs, crawled data, and metadata in a structured format for further analysis.
Technologies: PostgreSQL, MySQL.
Media Storage (Distributed File System):
Responsibility: Stores media files (images, videos) associated with crawled pages.
Technologies: Hadoop HDFS, Amazon S3.
Load Balancer:
Responsibility: Distributes incoming requests across multiple Crawler Worker Nodes to balance the workload.
Technologies: Nginx, HAProxy.
Monitoring and Logging:
Responsibility: Captures and logs system events, errors, and performance metrics for monitoring and troubleshooting.
Technologies: ELK Stack (Elasticsearch, Logstash, Kibana), Prometheus.
URL Extractor:
Responsibility: Extracts URLs from web pages to identify additional links for crawling.
Technologies: HTML parsing libraries (e.g., BeautifulSoup, jsoup).
Content Parser:
Responsibility: Parses HTML content to extract relevant information such as text, metadata, and links.
Technologies: HTML parsing libraries, regular expressions.
HTML Downloader:
Responsibility: Downloads HTML content from web pages for further processing.
Technologies: HTTP client libraries (e.g., axios, requests).
DNS Resolver:
Responsibility: Resolves domain names to IP addresses for efficient crawling and content retrieval.
Technologies: DNS resolution libraries, system-level DNS resolution.

Request flows

Check the below sequence diagram to see how the flow is orchestrated when a new url is being crawled.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?