Design A Web Crawler - System Design

System requirements

Functional:

Web Crawling:
The system should be able to initiate HTTP requests to web pages and download their content.
It must support different protocols (HTTP, HTTPS) and handle redirects appropriately.
The system should support handling various content types such as HTML, images, and other media files.
URL Frontier Management:
Maintain a dynamic URL frontier to prioritize and manage the order in which URLs are crawled.
Implement a mechanism to add new URLs to the frontier and remove them once crawled.
Support for scheduling URLs based on predefined rules or priorities.
Robots.txt Compliance:
The system must fetch and interpret the robots.txt file for each domain to ensure compliance with crawling rules.
Respect rules specified in robots.txt, including crawl-delay and disallow directives.
Data Extraction:
Extract relevant information from web pages, including text, images, links, metadata, etc.
Support for parsing and extracting structured data formats such as HTML, XML, or JSON.
Content Deduplication:
Implement a mechanism to identify and eliminate duplicate content to minimize redundant crawling.
Use hash functions or other techniques to compare and identify duplicate content.
Crawl Depth Control:
Provide a mechanism to customize the depth of crawling, limiting how many levels deep the crawler should follow links.
Support for setting depth limits based on domain or URL patterns.
Data Storage:
Store crawled data in a structured format (database, file system, etc.).
Support for efficient retrieval and update operations for the stored data.
Regular Crawling:
Schedule and perform regular crawls to ensure data freshness.
Support for incremental crawling to update only modified or new content.
Prevention of Infinite Loop:
Implement mechanisms to prevent the crawler from getting stuck in an infinite loop. Maintain state information to track visited URLs and avoid revisiting them unnecessarily.

Non-Functional:

Performance:
The system should be able to crawl and download content from a large number of web pages within a reasonable time frame.
Efficient handling of parallel requests to optimize throughput.
Scalability:
The system must be scalable to handle the specified workload of 1 billion web pages per month.
Reliability:
The crawler should operate reliably under normal and adverse conditions, minimizing downtime.
Robustness:
The system should gracefully handle errors and unexpected situations to ensure continuous operation.
Security:
Ensure secure handling of crawled content and avoid potential security vulnerabilities.
Implement secure communication (HTTPS) when interacting with web servers.
Throttling:
Throttling mechanisms should be effective in preventing the crawler from overwhelming servers while maintaining efficient crawling.
Compliance:
The system must comply with legal and ethical standards for web crawling.
Respect privacy and legal restrictions on crawling certain types of content.
Monitoring and Logging:
Implement comprehensive monitoring and logging to track system performance, errors, and crawling statistics.
Provide tools for administrators to analyze and troubleshoot issues.

Capacity estimation

Assumptions

The system crawls 1 billion web pages per month
Each web page has 10 media files on average
An average media file size of 1MB

Storage Required for Web Pages:

Let us assume each web page on average has 100KB of data, so we need

Storage = Number of Pages * Average Page size * 12 months

Storage = 1 billion * 100 KB * 12

Storage = 1,000,000,000 * 1,00,000 * 12 = 1.2 PB

So, we would need 1.2 PB storage for 1 year

Storage required for media files

Since each web page can have 10 media files and each media file is 1 MB.

Storage = Number of Pages * 1 media file size * 10 files * 12 months

Storage = 1 billion * 1 MB * 10 * 12

Storage = 1,000,000,000 * 1,000,000 * 10 * 12 = 120 PB

So, we would need 120 PB storage for 1 year

API design

Below is the list of API's required for the system, although this might not be the exhaustive list, this provides a good starting point.

/api/addUrl:
Description: This API is used to add a new URL to the crawler's queue for processing.
Input: The input will be a JSON object containing the URL to be added.
Output: The output will be a response indicating the success or failure of the operation.
/api/getNextUrl:
Description: This API retrieves the next URL from the crawler's queue for crawling.
Input: No input parameters are required.
Output: The output is the URL to be crawled next or an indication that the queue is empty.
/api/crawlPage:
Description: Initiates the crawling process for a given URL.
Input: JSON object containing the URL and associated crawl parameters.
Output: Response indicating the success or failure of the crawling process.
/api/getCrawledData:
Description: Retrieves the crawled data for a specific URL.
Input: JSON object containing the URL or other identification parameters.
Output: The output is the crawled data, including text, images, links, etc., in a structured format.
/api/setCrawlDepth:
Description: Allows customization of the crawl depth for a specific URL or domain.
Input: JSON object containing the URL or domain and the desired crawl depth.
Output: Response indicating the success or failure of the operation.
/api/updateRobotsTxt:
Description: Updates the internal representation of robots.txt for a given domain.
Input: JSON object containing the domain and the updated robots.txt content.
Output: Response indicating the success or failure of the update.
/api/getCrawlStatus:
Description: Retrieves the current status of the crawling process.
Input: No input parameters are required.
Output: The output is a summary of the crawling status, including the number of pages crawled, errors encountered, etc.
/api/setThrottle:
Description: Allows dynamic adjustment of the crawling speed to prevent overloading servers.
Input: JSON object containing the throttle parameters (e.g., delay between requests).
Output: Response indicating the success or failure of the operation.
/api/checkCrawledStatus:
Description: Checks if a given URL has already been crawled.
Input: JSON object containing the URL.
Output: Response indicating whether the URL has been crawled or not.
/api/removeUrl:
Description: Removes a specific URL from the crawler's queue.
Input: JSON object containing the URL to be removed.
Output: Response indicating the success or failure of the removal operation.
/api/getMediaForUrl:
Description: Retrieves media files (images, videos, etc.) associated with a specific URL.
Input: JSON object containing the URL.
Output: The output is a list of media files with their respective URLs and metadata.

Database design

For tables required for this problem, refer the below diagram.

Database Choice

Relational Database for URL and Metadata:
Example: PostgreSQL or MySQL
CAP Focus: Balanced (Consistency and Availability)
Reasoning: Relational databases typically offer a balanced approach between consistency and availability. They ensure data integrity through ACID properties (Atomicity, Consistency, Isolation, Durability) while providing reasonable availability.
Document-oriented Database for HTML and CSS Content:
Example: MongoDB
CAP Focus: Balanced (Consistency and Availability)
Reasoning: Document-oriented databases, while supporting eventual consistency, are designed to provide a balance between consistency and availability. They are suitable for semi-structured data like HTML and CSS content.
Distributed File System for Media Content:
Example: Hadoop HDFS or Amazon S3
CAP Focus: Availability
Reasoning: Distributed file systems are optimized for availability and partition tolerance, making them suitable for storing large media files. In a web crawling scenario, ensuring availability for continuous crawling is critical.

Application of CAP theorem on the web crawler: Focus on Availability and Partition Tolerance:

Reasoning: In the context of a web crawler, availability is crucial to handle a large volume of concurrent requests and ensure continuous crawling even in the presence of network partitions. While consistency is essential, a slight delay in updating crawled data due to eventual consistency is acceptable as long as the system remains available for crawling.
The focus on availability and partition tolerance implies that the system might experience eventual consistency in the data store, meaning that updates to the database may take some time to propagate across the entire system. This trade-off allows the web crawler to continue functioning even in the face of network partitions or temporary database unavailability.

Data Partitioning Strategy:

Hash-Based Partitioning:
Reasoning: Hash-based partitioning is the best strategy for this problem, as it evenly distributes URLs across multiple servers, ensuring a balanced workload and efficient utilization of resources. This approach facilitates horizontal scalability and helps avoid hotspots, providing consistent and predictable performance for the web crawling system.
Algorithm: For URL-based hashing in a web crawler system, a cryptographic hash function such as SHA-256 (Secure Hash Algorithm 256-bit) can be a suitable choice.

Sharding Strategy:

Hash-Based Sharding:
Reasoning: Hash-based sharding is the best strategy for this problem, as it distributes URLs, crawled data, and media files across shards based on a consistent hash function. This approach ensures even distribution, minimizes hotspots, and facilitates horizontal scalability, allowing the web crawler system to efficiently handle a large volume of data with balanced workloads.

Read/Write Separation:

Implementing read/write separation is beneficial. Since web crawling involves heavy read operations for retrieving data, separating read and write operations can enhance performance, allowing the system to scale horizontally for read-intensive tasks without affecting write operations.

High-level design

Frontend Component:
Responsibility: Provides a user interface to the admin for interacting with the web crawler system, allowing users to initiate crawls using seed URLs, monitor progress, and configure crawling parameters.
Technologies: Web framework (e.g., React, Angular), API communication.
Crawler Controller:
Responsibility: Orchestrates the overall crawling process, manages the URL queue, and coordinates communication between components. This can be a scheduled or a queue based job which keeps polling for new urls in the queue.
Technologies: Node.js, Python, or another suitable backend language.
URL Queue:
Responsibility: Manages the queue of URLs to be crawled, enqueues new URLs, and dequeues URLs for processing.
Technologies: Distributed queue system (e.g., RabbitMQ, Apache Kafka).
Crawler Worker Nodes:
Responsibility: Actively crawls web pages, extracts data, and saves information to the database.
Technologies: Node.js, Python, or another suitable language for parallel processing.
URL Deduplication Service:
Responsibility: This service ensures that duplicate URLs are minimized to avoid redundant crawling.
Robots.txt Checker:
Responsibility: Checks robots.txt files for compliance with crawling rules before initiating the crawl. This will be used to define boundaries and rules for the crawl job.
Data Storage (Relational Database):
Responsibility: Stores URLs, crawled data, and metadata in a structured format for further analysis.
Technologies: PostgreSQL, MySQL.
Media Storage (Distributed File System):
Responsibility: Stores media files (images, videos) associated with crawled pages.
Technologies: Hadoop HDFS, Amazon S3.
Load Balancer:
Responsibility: Distributes incoming requests across multiple Crawler Worker Nodes to balance the workload.
Technologies: Nginx, HAProxy.
Monitoring and Logging:
Responsibility: Captures and logs system events, errors, and performance metrics for monitoring and troubleshooting.
Technologies: ELK Stack (Elasticsearch, Logstash, Kibana), Prometheus.
URL Extractor:
Responsibility: Extracts URLs from web pages to identify additional links for crawling.
Technologies: HTML parsing libraries (e.g., BeautifulSoup, jsoup).
Content Parser:
Responsibility: Parses HTML content to extract relevant information such as text, metadata, and links.
Technologies: HTML parsing libraries, regular expressions.
HTML Downloader:
Responsibility: Downloads HTML content from web pages for further processing.
Technologies: HTTP client libraries (e.g., axios, requests).
DNS Resolver:
Responsibility: Resolves domain names to IP addresses for efficient crawling and content retrieval.
Technologies: DNS resolution libraries, system-level DNS resolution.

Request flows

Check the below sequence diagram to see how the flow is orchestrated when a new URL is being crawled.

Detailed component design

In this design, the system admin starts by adding seed URLs in the queue to start the process of web crawling. These URLs are added as messages in the queue which is then picked up by the Orchestration Web Servers, these servers continuously poll the queues for the latest URLs, these URLs are then forwarded to the load balancer which distributes the processing or crawling task to different web crawling worker nodes. These worker nodes call different components that validate and check the URL, download content and media, and store them in the database. Along with that another server parses the content and then extracts child links from the page and adds them back to the queue for being crawled.

Robots.txt

robots.txt is a standard used by websites to communicate with web crawlers and other web robots, specifying which parts of the site should not be crawled or accessed. It is a text file that resides at the root level of a website, and its purpose is to provide guidelines to web crawlers regarding the permissions and restrictions for accessing specific content.

The robots.txt file contains directives for specific web crawlers, identified by their user-agent names. The Disallow directive specifies which parts of the website should not be crawled. The Allow directive, when used, indicates exceptions to the Disallow directive.

Ethical web crawlers respect the rules specified in the robots.txt file and avoid crawling disallowed content.

Algorithm

For our web crawler, we will be using the Breadth-First Search (BFS) algorithm or a First-In-First-Out (FIFO) scheduling approach for link extraction, prioritization, and scheduling. Here are a few reasons for this decision:

Link Extraction: BFS starts by exploring the neighbor nodes of the initial URL before moving on to deeper levels. In the context of a web crawler, this means extracting links from the current web page and adding them to the queue for further processing.
Prioritization: BFS inherently prioritizes links at the same level (found in the same page) equally, ensuring a fair and balanced exploration of URLs across different depths.
Scheduling: The FIFO nature of BFS ensures that URLs are processed in the order they are discovered, promoting a systematic and structured traversal of the web.

Optimization Strategies:

Crawl Depth Control: Implementing crawl depth control allows the system to limit how many levels deep it follows links, preventing the crawler from going too deep into a website's hierarchy.
URL Deduplication: Ensuring that duplicate URLs are not crawled again helps avoid redundant crawling, optimizing the efficiency of the web crawler system.
Dynamic Throttling: Implementing dynamic rate limiting based on the server's response times helps prevent overloading servers and promotes fair and respectful crawling.
Distributing Requests: Requests to the same web server should be distributed across different servers, also there should be some gap between sending requests to the same server to prevent flooding the target servers.

Rate limiting strategies

Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.

The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue.

How to maintain Freshness?

Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource-intensive. A few strategies to optimize freshness are listed as follows:

Recrawl based on web pages’ update history.
Prioritize URLs and recrawl important pages first and more frequently.

Optimization Strategies

Distributed crawl

To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs.

Cache DNS Resolver

DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.

Locality and Load Balancing

Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.

While load balancing the incoming processing requests, we can understand the locations of the target servers and then distribute the requests to the web crawlers that are closest to these target servers.

Spider traps

A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...

Such spider traps can be avoided by setting a maximum length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters.

Trade offs/Tech choices

Sharding Challenges and Trade-offs:

Data Distribution:
Challenge: Uneven data distribution can occur if the sharding key is not selected carefully, leading to hotspots where certain shards receive more traffic than others.
Trade-off: Balancing the distribution may require redistributing data or choosing a different sharding strategy, potentially impacting system performance during re-sharding operations.
Complex Queries:
Challenge: Sharding can complicate queries that involve multiple shards, requiring coordination and potentially impacting performance.
Trade-off: Striking a balance between sharding for scalability and maintaining the ability to perform complex queries may involve careful schema design, denormalization, or use of distributed databases with support for distributed joins.
Data Migration:
Challenge: Moving data between shards during scaling or rebalancing can be resource-intensive and may lead to temporary unavailability.
Trade-off: Choosing an efficient data migration strategy involves trade-offs between downtime, resource utilization, and data consistency.

Failure scenarios/bottlenecks

BFS and problems with BFS

BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:

Most links from the same web page are linked back to the same host. for example, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com).
When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”.

Standard BFS does not consider the priority of a URL. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc.