Assume we download 100 million pages in a week, that's 170 pages per second. Say each page is 1 MB on avg.
Read and write QPS = 170
Write / Read throughput = 170 * 1MB = 170 MB / second
Data = 100 million * 1 MB = 100 TB per week
Say we do web archiving for 10 yrs. This is 100 TB * 52 weeks * 10 yrs = 52 PB
Define what APIs are expected from the system...
Data is too big to fit into RDB. Also, write and read throughput is very high. We use NoSQL.
We should separate metadata (texts) and media (blobs). Texts may be searchable and may need to be present in snapshot, and also we need to extract links. A good option may be document DB like MongoDB. Note that the data scale is large, we need partition. Also note that partition can help fitting indexes into memory for fast lookup and write.
Blobs are less frequently visited (in our case, when users retrieves an archive), we can choose an object storage like AWS S3. S3 can easily scale out, and support tiering by moving old content to cold storage.
See high level diagram
We use queues to achieve async ops, so as not to block critical tasks. The job queues help that link extraction and scheduling and downloader do not block each other. The data queues help link extractor and content storing not block each other.
We initially get page ranking, so that we can extract most popular sites first. For each site, we can start with DFS. However, this may impose high traffic stress on a single site. Our scheduler should have rate limiting algorithm per site basis. If a site has been visited many times, the new links from it should be penalize and thus lower ranked.
On politeness, in addition to rate limit, we should also respect robots.txt. Disallowed subdomains shall not be added to our jobs.
For DB, we shall have replicas for preventing data loss. For partition, we can do based on site domains, because user typically visits archives per domain basis.
See previous sections.
Our scheduling service may be bottleneck, due to so many links, computational heaviness on page ranking and rate limit penalization. We may have many replicas, each of which handles a subset of domains. Caveats:
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?