Assumptions
Storage Required for Web Pages:
Let us assume each web page on average has 100KB of data, so we need
Storage = Number of Pages * Average Page size * 12 months
Storage = 1 billion * 100 KB * 12
Storage = 1,000,000,000 * 1,00,000 * 12 = 1.2 PB
So, we would need 1.2 PB storage for 1 year
Storage required for media files
Since each web page can have 10 media files and each media file is 1 MB.
Storage = Number of Pages * 1 media file size * 10 files * 12 months
Storage = 1 billion * 1 MB * 10 * 12
Storage = 1,000,000,000 * 1,000,000 * 10 * 12 = 120 PB
So, we would need 120 PB storage for 1 year
Below is the list of API's required for the system, although this might not be the exhaustive list, this provides a good starting point.
For tables required for this problem, refer the below diagram.
Application of CAP theorem on the web crawler: Focus on Availability and Partition Tolerance:
Data Partitioning Strategy:
Sharding Strategy:
Read/Write Separation:
Check the below sequence diagram to see how the flow is orchestrated when a new url is being crawled.
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Explain any trade offs you have made and why you made certain tech choices...
Try to discuss as many failure scenarios/bottlenecks as possible.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?