Assume we'll get 1 B pages per crawl schedule. Assume an average page has 20 KB in text. 20% of pages have media, and average media size is 10 MB.
One crawl cycle = 1 B * (1* 20 KB + 20% * 10 MB) = 20 TB text and 2 PB media
one year of history = * 52 = 1 PB text and 100 PB media
20 TB / 86400 / 7 = 33 MB per second.
We need multiple servers to crawl and multiple DB nodes to store data.
int get(
list extract(
int download(
int write(
Separate media and hypertext. Media requires extra tags to support indexing and searching. They also require thumbnails for different resolutions and frame rates for preview and tiered access, if applicable. Hypertext can be reverse indexed to support searching. Store media in blob DB and hypertext in document DB, e.g., MongoDB.
We keep history of previous crawls. Depend on how frequent we need to visit history, we use tiered storage for current WIP crawl and historical crawls. For instance, older-than-a-month crawls may be moved to cold storage to save cost.
See diagram
URL Fetcher, Extracter; Hypertext Writer; Media Downloader, Processor
We need multiple nodes for each of these services. Note that one physical node may perform multiple tasks in parallel because of multi-core, as long as network bandwidth and disk throughput allow. This will improve overall crawling throughput.
URL queue, Hypertext queue, Media queue
As the service nodes are multiple, these queues need to be MPMC (multi producer multi consumer) queues. They don't have to follow strict ordering as it is not important if one url must be fetched before or after another specific url. However, either a lock shall be present or a lockfree design shall be used, to avoid different nodes writing or reading the same slot in the queue.
DB partitions
DB content size is huge, we partition into multiple DB nodes. Such a partition may be based on the base url domain, or checksum of the content, or system-generated uid of the content, etc.
Queues
We may need a leader-follower pattern, in case the main queue fails and we lose data. The follower queue is not accessed by service nodes, but is self replicating the main queue's content.
DB partitions
If we partition by url base domains, certain partition may be excessive in size because the website scope is huge (e.g. wikipedia), and accessing content from the domain will result in a hot node. The pros is that all content from the same domain are stored together, faster and easier for indexing and searching. If we partition by checksum, random distribution of content is achieved and less likely we will have a hot node. However, indexing and searching content from the same domain will require joining results from multiple nodes, which is extra latency and implementation complexity.
Service nodes
Similar pros and cons regarding we have leader-follower patterns in service nodes, such as URL fetcher. In addition, downloading many contents from the same domain may overstress that domain server, causing the domain slow to respond to regular internet traffic or even bring it down. A mitigation is that we may start with a list of seed URLs and let each service node work on different domains.
As mentioned above, we have replicas and leader-follower patterns almost every place in the design. If one node fails, a follower node will be elected as the new leader.
The media downloader and processor may be throughput bottleneck, as it requires a lot of network bandwidth (for downloader) and computing power (processor). This may make the media queue become full. In such case, we could introduce more media service nodes for larger parallelism.
We may want to decide which URLs to download first, depending on site ranking or other factors.
We may want to finish crawling more quickly, in a day or two, in which case we need to horizontally scale our system much more. However, this means we also stress the internet more. Make sure we should find a balance before doing so.