Design Dropbox - System Design

Requirements

Functional Requirements:

Allow users to upload files to the system.
Enable users to download uploaded files.
Ensure synchronization of files between local and server storage.

Non-Functional Requirements:

Durability: this is a cloud storage, we must guarantee the data availability
Available: we must ensure highest availability possible
Scalable: handle huge amount of data

500M DAU

user upload and download 5 files a day

1/10 upload media

avg file size 200kb

avg media 50mb

500.10^6 * 5 * 200.10^3 = 500TB per day -> 182PB per year

100k sec a day -> 40Gbps throughput

max connections 20% of DAU

100.10^6 / 50.10^3 = 2.10^3 = 2000 servers for peak

API Design

uploadDocument(user_id, name, content, file_type, metadata):
- post, upload a new document to the server
uploadMedia(user_id, name, content, media_type):
- post, upload a media (img, video)
updateDocument(user_id, document_id, content):
- update, update a existing document, 4xx if not existing. The history is saved to mongo doc
downloadDocument(user_id, document_id):
- get, retrieve and download a document
searchDocument(user_id, query):
- get, search for a document, can be done with mongodb queries, can implement a complete search system later (e.g. apache lucene or elastic search) or adding tags / directories to documents
sync(user_id):
- return the mongo db documents last versions to compare with local version. The userapp will then engage in upload/download of the different version for consistency

High-Level Design

We are going to start with multiple app servers, with in front a api gateway and load balancer.

The app servers will be geographically distributed, because users will most likely always connect to the same datacenter. The app servers will have to connect to different storages. For media files, they will be stored in a blob storage, and for the standard files, in a HDFS. All the metadata will be stored in a nosql db, I think mongoDB would be a good choice (scalable, flexible shema, and we are okay with eventual consistency). That way we can also in the files document metadata to keep a versioning or history if we want to allow. A cache doesn't seem useful there because users will not likely query the same thing multiple times in a row. Same with the CDN which doesn't have purpose with personal data. The client and server will have a polling system to save periodically the document (or check for server updates). And async garbage collector can check for unreferenced data and delete them.

We want to have the system to be sharded using user key and replicated at least 3 times, in same rack, datacenter and different continent.

We can also think in a later version to optimize space to use a versioning tool like git to only save the file changes to avoid duplicate storage

Detailed Component Design

The file manager server is handling the connection with the user, and allowing upload and downloads. When uploading a new document, it will save the content to the HDFS and update the mongoDB to save metadata. Updates to the document will create a new document to hdfs and update the mongodb document with a history list. It will also use the mongoDB for checking for files to download.

for the concurrency, we will always consider the last write to be the correct one.