Requirements
Functional Requirements:
- Allow users to upload files to the system.
- Enable users to download uploaded files.
- Ensure synchronization of files between local and server storage.
Non-Functional Requirements:
- Durability: this is a cloud storage, we must guarantee the data availability
- Available: we must ensure highest availability possible
- Scalable: handle huge amount of data
500M DAU
user upload and download 5 files a day
1/10 upload media
avg file size 200kb
avg media 50mb
500.10^6 * 5 * 200.10^3 = 500TB per day -> 182PB per year
100k sec a day -> 40Gbps throughput
max connections 20% of DAU
100.10^6 / 50.10^3 = 2.10^3 = 2000 servers for peak
API Design
- uploadDocument(user_id, name, content, file_type, metadata):
- post, upload a new document to the server
- uploadMedia(user_id, name, content, media_type):
- post, upload a media (img, video)
- updateDocument(user_id, document_id, content):
- update, update a existing document, 4xx if not existing. The history is saved to mongo doc
- downloadDocument(user_id, document_id):
- get, retrieve and download a document
- searchDocument(user_id, query):
- get, search for a document, can be done with mongodb queries, can implement a complete search system later (e.g. apache lucene or elastic search) or adding tags / directories to documents
- sync(user_id):
- return the mongo db documents last versions to compare with local version. The userapp will then engage in upload/download of the different version for consistency
High-Level Design
We are going to start with multiple app servers, with in front a api gateway and load balancer.
The app servers will be geographically distributed, because users will most likely always connect to the same datacenter. The app servers will have to connect to different storages. For media files, they will be stored in a blob storage, and for the standard files, in a HDFS. All the metadata will be stored in a nosql db, I think mongoDB would be a good choice (scalable, flexible shema, and we are okay with eventual consistency). That way we can also in the files document metadata to keep a versioning or history if we want to allow. A cache doesn't seem useful there because users will not likely query the same thing multiple times in a row. Same with the CDN which doesn't have purpose with personal data. The client and server will have a polling system to save periodically the document (or check for server updates). And async garbage collector can check for unreferenced data and delete them.
We want to have the system to be sharded using user key and replicated at least 3 times, in same rack, datacenter and different continent.
We can also think in a later version to optimize space to use a versioning tool like git to only save the file changes to avoid duplicate storage
Detailed Component Design
The file manager server is handling the connection with the user, and allowing upload and downloads. When uploading a new document, it will save the content to the HDFS and update the mongoDB to save metadata. Updates to the document will create a new document to hdfs and update the mongodb document with a history list. It will also use the mongoDB for checking for files to download.
for the concurrency, we will always consider the last write to be the correct one.