Design Google Doc with Score: 8/10. Would like some actual non chatGPT feedback

by echo7959

System requirements


Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

  • Be able to upload and download the file
  • Can recover a upload if it gets interrupted in the middle
  • Multiple users can collaborate together
  • Other online users will be notified of any changes and get the latest updates
  • The file will be able to sync for the latest update


Non-Functional:

List non-functional requirements for the system...

  • The google doc will be able to serve user in different location in a timely manner
  • Resiliency: The data/file should be available
  • Efficient and safe download and upload
  • Stability: if in the future, we have more users and it should be able scale up easily


Capacity estimation

Estimate the scale of the system you are going to design...

Let's assume that we have 5 million active users daily across everywhere in the world. And a google doc file size is around 500kb. We might only have 500k active user per min so the worse case is that they all work on a single document and that will bring us to a umber of 500k*500kb storage. We might also have more users in a certain region than the other and more active in a certain hours like 8-5pm during business hours in north America.




API design

Define what APIs are expected from the system...

  • Post -> documents/{file_id}/upload/{version_number}

This will return an object like result_object{ is_successful = True/False; version = version_number; user =user_id; time=time}

  • Post -> documents/{file_id}/upload/recover/{version_number}

This will return an object like result_object{ is_successful = True/False; version = version_number ;user =user_id; time=time}

  • Get -> documents/{file_id}/download/{version_number} this will be default to the latest version but the user can specify a version they would like to download

This will return an object like result_object{ is_successful = True/False; version = version_number; user =user_id; time=time}

  • Patch -> send notification POST /documents/{file_id}/notifications 


  • Patch -> send notification POST /documents/{file_id}/update  

This will return a result like result_object{ is_successful = True/False; version = version_number; user =user_id; time=time; conlifct=True/False}


  • Patch -> resolve conflicts POST

/documents/{file_id}/conflicts/{accepted_version_number}  

When the we get conflict=False from the result of the update endpoint, we will need to prompt the user to resolve the conflicts via the notification end points. After that, we will use the conflicts end point to resolve the conflicts.

This will set conflict =False and return a result

result_object{ is_successful = True/False; version = version_number; user =user_id; time=time; }


Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

We will use relational db like postgresSQL. It will have a couple basic tables including the file table, file history table and user table. The file table is going to have information like file name, owner name, last modified time and file id. The file history table has the change history of the file. It has file id, modification time, and the version id, modifier's user id. The user table has the name of the user, the user's role(owner vs non owner) , file_id column.




High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...


On a high level, the user's request will go through a load balancer to direct the traffic to one of our server before it queries to the db. We are also adding a block server for upload and download.




Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...


For regular update (the data that it transfers would be smaller), it will go to the server and to the db to update it.

For file upload or download(when the data that it transfers is bigger), it will go through the block server then go to the db.



Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


Both the server or the block server can scale up easily because they are linked to a load balancer. The load balancer will use least load or least response time algorithm.


For the block server, it can recover a file quickly because you dont need to redownload or reupload from the begining, you just need to restart from the block where it got interrupted so this is good for handling large file or when the user has bad internet connection.


For the database, it's going to be relational db and we are going to use relational db. We are going to use sharding for the db based on the user ID. We can have a cold storage and a hot storage. This means that for the file that's not frequently access by user, we can store it in S3 cold storage. For the file that's frequently accessed by user, we can put it in the hot storage. This will help to lower the cost. Also, we are using AWS S3, we are going to have data stored in different region. This will help users from different region to download or upload the file quicker. Also, it keeps several copies so it's more resilient.


Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


  1. Using a load balancer adds an additional layer between the server and user. This makes the system more attack proof from security stand point (eg. DDOS). Also, with the load balancer, it can improve the efficiency of the system. For example, we can scale up easily by adding more sever. Also it can distribute the weight and help to get a quicker and more reliable response.
  2. Using database. We choose relational db over the nosql db for storing meta data. This is because the size of meta data for each user tend to be small and it's also easier for us to query the data. However, for storing the file, there should be another table that's no sql because the file size can be bigger and we dont query the content inside the file.
  3. The cold and hot storage in S3 can help us to save some cost and improve the efficiency of the db.
  4. The block server improves the resiliency of the system because we can redownload and reupload easily without starting from 0 again. However, this might slow it down a little bit as well because we need to chunk the file and encrypt the file during the process.


Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.


  1. The failure scenarios would be query the database might be slower with the hot and cold storage because we need to go back to check the cold storage.
  2. The block server can slow down the upload and download because we need to chunk the file and encrypt the file during the process.
  3. Sharding based on the user id might be a bottle neck if we have some bot users that download and upload large files constantly. This will exhaust the resources quickly.



Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?


  1. We can upgrade the size of the db if the cost is not an issue instead of using the cold and hot storage in S3
  2. there could be another load balancer in front of block server and increase the number of block server to improve the efficiency.
  3. Add some security around the system to prevent the API called by another machine. We can also talk about adding role management to the system for better monitoring.