List functional requirements for the system (Ask the chat bot for hints if stuck.)...
List non-functional requirements for the system...
Estimate the scale of the system you are going to design...
Assume: 50 billion webpages, average size of each page is 50kb;
10M new pages are created everyday.
The storage needed should be:
(50B + (10M * 365 * 5))*50kb = 3.4 PB
Define what APIs are expected from the system...
This will be offline work and no API is needed.
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
key: url string
value: s3 url for page content , string
The table schema is quite simple, and data will be queries and indexed by key. In this case a nosql database like dynamodb would satisfy our needs, while providing more flexible scaling compared to sql database. The db can be easily sharded by url string.
This db stores content of a page, which is html file and may have large size. So a blob storage suits our use case. The data will be indexed, sharded and retried by url generated by the blob storage.
key: url string
value: page_count integer - representing the number of pages that points to this url
This is a in-memory key-value data structure, sorted by page_count in reverse order. To guarantee durability, data is persisted in WAL file before updating in-memory structure, which is replicated across data centers. A in-memory queue like Redis should meed our requirements. Data can be indexed, sharded and retrieved by url string, and each shard maintains its own sorted data structure independently.
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Task Queue
Explain any trade offs you have made and why you made certain tech choices...
Try to discuss as many failure scenarios/bottlenecks as possible.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?