System requirements


Functional:

-> The users can create accounts, login and create a personal directory structure.

-> The users should be able to organize their directory by creating folders-subfolders and moving files on each level.

-> the users can upload new files at certain path, download a file or remove.

-> The users can share one folders or files with other users of the system.

-> The users can live edit a file and publish changes for a file not shared.

-> The users on a shared context can edit the same file. The file is published

with user version, in case of editing the same file the system will try to merge and the second user to published a merge version.


Non-Functional:

-> The system should be highly available and the users can access their files anytime.

-> The system should be reliable and the data saved should be lost.

-> The system should be easily scalable in case of the system becomes popular.

-> Security should be enforced since the users cannot see files if they are not shared.




Capacity estimation

Let's assume the system has 100K daily active users. Each users uploads 3 new files or restructure of the files, edit 5 existing files daily, and reads the content of 10 files daily. The users will distribute 2 files on average with 3 users.

Will suppose the content of the files are mainly text and only text files are edited. We can estimate the file size to be on average 1MB / file.

The metadata of files and structure.

-> id, filename, filetype, owner, created_date, published_date ~= 100bytes.

=> 300bytes * 100 000 = 30 MB daily of metadata for files.

For sharing of the data and store the permission for each sharing will need:

id, fileName, owner_id, userId, permission type: read, read and write. ~= 100 bytes

6 shares daily * 100K => 600K shares daily => 60MB shares daily.


Given the calculation I think read heavy database would do the job, a document database like MongoDB will allow to store the structure of the directories in a tree fashion. The content of the directory the row of the database will contain unstructured data with the children files and children subfolders.




API design

POST /api/upload with the request data,

{

"objectName" : string,

"objectType" : file, archieve

"user_id": uuid,

"path_to_be_stored": string,

} response 201 Created with { "id": uuid, "retrieve_path": string}


PUT /api/file/{id}

response 200 OK {

"id": uuid,

"filename": string,

"created": uuid,

"shares": [],

}


PUT /api/file/share

{

"file_id": uuid,

"user_ids": "",

"permissions": "",

} response 200 OK





Database design


We have mainly 4 objects which help us to define the relation:

-> Users: the users which will login in the system and will create his own directory structure

-> Objects : represents the directories and files, they will store the list of children ids which can be folders or files. An tree structured will be created further.

-> ObjectVersion : represents the edited version of the file and then published, each user of shared file can have his own versions and publish that version or a merged version with another shared user.

-> Shares: For each file or directory a list of users which have access to the object is computed for fast retrieval,


For simplicity and more relation retrieval an SQL database can be used since the volume of data and transactions will allow it for the assumed values, Though a NoSQL should be employed for this system to be easily scalable, it is a tradeoff, I would start with SQL then migrate to a more scalable database




High-level design

Gateway used in the system to handle authentication and authorization, another service may be used to manage the authorization on objects.

Load balancer to distribute traffic between copies of the same service, e.g. Upload service

Websockets used for notifications in case a new version is published by other users.

CDN used to read files and improve latency

Blob storage used to store the files and improve speed of the upload since files will be uploaded here first and metadata saved in db.

Blob storage solutions like S3 can also manage chunking no need to create proprietary solution for chunking and store information about chunks.









Request flows

Upload of the file: the client will initiate the file upload at a given path chosen by him.

The client app will contact Gateway to check authorization. Then will redirected to the Upload service to get an S3 pre-signed URL. The url will be returned to client.

The client will upload the files to S3.

Upon completion of the upload a new request will be initiated to the upload service.

This time the upload service will store the metadata of the file uploaded to DB.





Detailed component design

Sharding of the database is out of the box since we chosen Mongo DB which has built-in scaling capabilities. The sharding can be done based on the user's id as partition key

Notifications are realtime using websockets protocol

Replication of the database in multiple availability zones as part of disaster recover strategies.


The concurrency is assured using a distributed lock which need to be acquired before publishing a new version. If the second users wants to publish it's own version and a newer version exists an new merged version is created by the system and the user can choose which one to publish.

The notification system will notify the user if newer versions are published managing small merges but more frequent.



Trade offs/Tech choices

An High trade-off is made to store the shares in a NoSQL database since the list of users having access to the document need to be precomputed. Also storing an hierarchical structure can complicate the logic.

Replicated metadata Database need to be involved.

Websockets notifications are hard to scale but are near realtime solution




Failure scenarios/bottlenecks

Saving offline copies of the document and the user is not online when to publish.





Future improvements

-> Chunking of the data to allow files bigger than 5MB.

-> The users should be able to review one document not by editing but by adding comments. The user will highlight the text and can add comments related to that part of the text.