Requirements
Functional Requirements:
- Allow users to upload files to the system.
- Enable users to download uploaded files.
- Ensure synchronization of files between local and server storage.
- Support multiple clients such as web, mobile and synchronization between them
- Enable users to create folder and sub folder to manage files.
- User should be able to share files with other users
Non-Functional Requirements:
- System should be highly available. That means failover of critical services, multi region support and distributed traffic.
- System should be reliable and durable. Once the file is uploaded, it would persist.
- System should have network fault tolerance
- Data should encrypted during rest and transit.
- Implement checksum to ensure data is not altered.
- Have Strong ACL in place to allow actions only from authorized user and role
Capacity Estimation
For a system like Google Drive or dropbox, There would be 100M users. Now if we consider 80:20 rule then 20M users would be active on daily basis.
Total users - 100M
Daily active users - 20M
About 20% would use upload feature so
Daily Uploads by - 4M users
Daily Download/Read by - 16M users
If each daily active user upload atleast one file-
On an average 1 file would be around 1MB and 1 user would be holding around 1000 files.
Network -
Daily Upload Request - 4M files
Upload QPS = 40 files/Sec
Download/Read QPS = 160 files/Sec
Since we want design a system with high availability, come up with number of cores that would be able to achieve ~200 requests/sec
For eg. - if 1 core can handle 10 requests then
= (200/10)*(25/100) cores would be required. 5 cores would be required. In order to achieve even distribution and high availability you would need atleast 3 servers having 5 cores each cluster to handle
Storage -
Total storage needed - 1GB/User i.e. 100M * 1GB = 100MGB = 100 PB
Consider 1% traffic would be growing every month, you would need to add 12PB capacity every year. and in order to achieve reliability, you would replicate this data to cross regions.
so at the end of the 2 year you would need 254PB of capacity for a current year.
at the end of 2 years -
10% more upload traffic i.e. 4.5M daily uploads and 20% more download traffic i.e. ~20M download/read requests.
With that, you would need 1 extra core at each server level.
Metadata Capture-
100M users * 1000 files
1 KB metadata/file
100M * 1000 KB = 100 GB
2 years projections = 20% = 120GB File metadata
120 GB User data
1 Posgres should be enough to handle 240GB data.
API Design
List files and folders on home page
Get /api/v1/content
Request - user id
Response - {Folder:[], Files[]}
List files and folder under a specific folder
Get /api/v1/content
Request - user id, folder id
Response - {Folder:[], Files[]}
Create Folder
PUT /api/v1/folder
Request - user id, parent_folder_id
Response - new_folder_id
Upload File
POST /api/v1/file
Request - user id, folder id
Response - url to show status
Status 201 if successful, 202 if upload in progress
Download File
GET /api/v1/file
Request - user id, file id
Response - url to show status
Status 201 if successful, 202 if download in progress
Change permission of file
POST /api/v1/permission?fileid={}
Request - {permission : [read/write/view]}
Share file
GET /api/v1/link?fileid={}
Request- user id
Response - url
High-Level Design
Client web and mobile would be holding the local app to record local changes.
Write service would drop a message to kafka queue to sync with client.
Postgres would manage file metadata, user data and ACLs. File table would hold the record S3 object url for the respective file.
Database Design
Postgres - to capture user details and files/folders metadata
S3 - To store data as objects
Postgres tables - users, files, folders, user_file
Detailed Component Design
If user is offline and not connected to the network then local client would connect to sync service to first bring new changes to/from server.
System reliability - Read service would be running with another 2 more instances that would help during fault tolerance.
S3 would be running with another replica on another region. This would be async replication to maintain durability.
To maintain latest updates on file, each file and sync service would maintain the version, whichever is having higher version that change would be considerd to be updated. If version numbers are matching then comparing would be done based on the timestamp.