Design Dropbox - System Design

Requirements

Functional Requirements:

Allow users to upload files to the system.
Enable users to download uploaded files.
Ensure synchronization of files between local and server storage.
Support multiple clients (mobile, desktop 1, desktop 2)

Non-Functional Requirements:

High scalable
Synchronization between different clients should be fast

Capacity Estimation

We have 1 billion users
1000 files per year
The average file size is around 10 MB file
This results in roughly 10EB of data
The metadata is 1KB per file
This results in our metadata DB need around 1 PB of storage.

API Design

Upload file (File name) -> success with file link
Download_file (file link) -> receive file
Synchronization (File ID, last Updated timestamp) -> List of files

High-Level Design

We have the following entities in our design

Users (list of clients, list of files that the user owns)
Files (individual file information). This is the file ID, a blob ID to point the storage, metadata like file last updated, creation date, privacy settings,
Links (URL link, file ID)

Sequence of steps

Upload a file

Client starts Upload_file(API)
HTTP request goes to the API Gateway
The API gateway routes the request to the file service
The file service creates a File in our Database
The file service then returns a BLOB storage link to the client for the client to then upload to Blob DB
Afterward the blob storage upload is completed the client then passes finalized BLOB ID into the file service
The file service marks the File in our database as complete and generates a link for the file

Synchronization

Synchronization happens as part of the uploading the file step
After step 6 of the upload file is complete, we can start preparing to synchronize other devices
Clients automatically poll the server every 1 min or 30seconds or so. When they poll the they submit their list of files and their files last updated timestamp. If theres a mismatch for files or a some files are missing
Then the file service looks up the Database and returns the various files and their blob links
The client then fetches and updates each file by fetching from the blob DB

Download a file

Client requests to download a file
HTTP request goes to API gateway
the file service looks up the File in our database
The file service returns the blob link to download the file
Client interacts with the Blob DB directly downloading the file

Database Design

Because the size of the files are so big, we are abstracting much that away by using this global blob DB. This makes it so that the vast majority of the file storage is handed by something like amazon S3 or google GCS.

The important metadata information for files are kept in our file Database thats in use by our file service. The database is mostly going to be queried for the file URL and the file ID. We need to have indexes on the URL and file ID. The URL is for downloading the file for other users. For different clients, they would have the internal file ID and we need to be able to return the BLOB link for the file ID.

Another option is to have a separate table for file URLs that return the FILE ID associated with that URL. I like that more.

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

Since so much data exists, we need to split the data across multiple database instances.

We're going to partition the data by fileID. This splits the data roughly evenly across nodes. Another way is to partition by the user, but that results in hot users overload certain nodes. Partitioning by nodes can result in a hot file overloading a node.

Our file service is going to need to handle roughly 30M files per second. At 1 KB per file, this is 30GB/s of data traffic coming in just for the file. This is probably close to above our theoretical network bandwidth for a single node. And the node itself has to do some work, so we assume the file service needs to be scaled. We add a load balancer to the API gateway so that it can choose between different nodes of the file service. Since each file from the user is expected to be different we can generally assume no concurrency issue between the different file services. In normal operation each file service node will write the data into the file database, generate a blob DB link for the user to upload to, there should be no conflicts between the files.

How to make synchronization fast?

Currently each time a new client of a user logins, they send all the files they have to the file service and their last updated timestamp. The file service then needs to look up each file in the file Database and return the relevant blob links if they need to be updated. For users with 1000s of files, this can be very slow process even if there is only 1 or two files to be updated. One way to solve this is to maintain a queue that has a separate channel for each client. Now when a file is uploaded from one client, we write the blob Link of that file to a channel each client is subscribed to.

How to support updates on files that are large?

One issue with our current design is that we rely on the user to upload and update the blob DB. This is good for a new file, but for files that are updated, we don't necessarily want to upload a new blob into the DB and definitely not the full file. Instead we should have a separate API for updating a file vs uploading a new file. For updating a file, the user uploads the FILE ID and the diff. The file service will be in charge of changing the blob file in the blob DB.

Instead of having the client 2 synchronizing redownload files from the blob DB, instead the update service is writing diffs into the pending updates queue. When the client tries to sync, the update service reads the pending updates queue and returns the list of diff history. The client 2 then performs those diffs on their files. This assumes the files exist, for files that the client has never downloaded, they would have to retrieve from the Blob DB.