Assumptions
Storage Required for Web Pages:
Let us assume each web page on average has 100KB of data, so we need
Storage = Number of Pages * Average Page size * 12 months
Storage = 1 billion * 100 KB * 12
Storage = 1,000,000,000 * 1,00,000 * 12 = 1.2 PB
So, we would need 1.2 PB storage for 1 year
Storage required for media files
Since each web page can have 10 media files and each media file is 1 MB.
Storage = Number of Pages * 1 media file size * 10 files * 12 months
Storage = 1 billion * 1 MB * 10 * 12
Storage = 1,000,000,000 * 1,000,000 * 10 * 12 = 120 PB
So, we would need 120 PB storage for 1 year
Below is the list of API's required for the system, although this might not be the exhaustive list, this provides a good starting point.
For tables required for this problem, refer the below diagram.
Application of CAP theorem on the web crawler: Focus on Availability and Partition Tolerance:
Data Partitioning Strategy:
Sharding Strategy:
Read/Write Separation:
Check the below sequence diagram to see how the flow is orchestrated when a new URL is being crawled.
In this design, the system admin starts by adding seed URLs in the queue to start the process of web crawling. These URLs are added as messages in the queue which is then picked up by the Orchestration Web Servers, these servers continuously poll the queues for the latest URLs, these URLs are then forwarded to the load balancer which distributes the processing or crawling task to different web crawling worker nodes. These worker nodes call different components that validate and check the URL, download content and media, and store them in the database. Along with that another server parses the content and then extracts child links from the page and adds them back to the queue for being crawled.
Robots.txt
robots.txt is a standard used by websites to communicate with web crawlers and other web robots, specifying which parts of the site should not be crawled or accessed. It is a text file that resides at the root level of a website, and its purpose is to provide guidelines to web crawlers regarding the permissions and restrictions for accessing specific content.
The robots.txt file contains directives for specific web crawlers, identified by their user-agent names. The Disallow directive specifies which parts of the website should not be crawled. The Allow directive, when used, indicates exceptions to the Disallow directive.
Ethical web crawlers respect the rules specified in the robots.txt file and avoid crawling disallowed content.
Algorithm
For our web crawler, we will be using the Breadth-First Search (BFS) algorithm or a First-In-First-Out (FIFO) scheduling approach for link extraction, prioritization, and scheduling. Here are a few reasons for this decision:
Optimization Strategies:
Rate limiting strategies
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue.
How to maintain Freshness?
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource-intensive. A few strategies to optimize freshness are listed as follows:
Distributed crawl
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs.
Cache DNS Resolver
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.
Locality and Load Balancing
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.
While load balancing the incoming processing requests, we can understand the locations of the target servers and then distribute the requests to the web crawlers that are closest to these target servers.
Spider traps
A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...
Such spider traps can be avoided by setting a maximum length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters.
Sharding Challenges and Trade-offs:
BFS and problems with BFS
BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:
Standard BFS does not consider the priority of a URL. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc.