Design Dropbox - System Design

My Solution for Design Dropbox with Score: 7/10

by utopia4715

System requirements

Functional:

User can upload files to this system.
User can update existing files.
User can organize files in directories.
User can share file or directory with other users.
File sharing options are read-only, or read-write.

Of course there are many other requirements such as preview, collaboration, and so on. But I will start with these requirements.

Non-Functional:

Data durability. Once file is uploaded, it should never get lost.
Availability
Scalability. Many users.
Consistency. It would be desirable for files to be consistent across data centers.
Security. Files can be accessed by the owner and users with explicit sharing permissions. By default, its't not shared by anybody other than the owner.
Response time: reasonable. Would like to show a home directory quickly (e.g. sub second), but if upload & download takes 1 - 10s depending on file size, that'll be acceptable.

Key Points:

Data partitioning and backup.
Tradeoff between consistency and scalability.
Privacy concern w.r.t. deleting files.

Capacity estimation

200M DAU

Each person reads twice a day, writes once a day.

Average file size: 1MB

20M new files generated per day

20TB new files per day.

In two years, it would be 14.6PB of data

Rough Idea:

Raw data storage, like Amazon S3, for files themselves.
RDB for metadata

API design

upload(user_ID, file_content, file_name, parent_directory_ID) -> returns file_ID in JSON along with some metadata.
update(user_ID, file_ID, file_content) -> replaces the whole file
download(user_ID, file_ID)
preview(user_ID, file_ID)
mkdir(user_ID, parent_directory_ID, directory_name)
share(file_ID, [user_IDs], sharing_type) -> directory_ID can be used in place of file_ID. group_IDs can be used in place of user_IDs. Sharing_type has several options: read only, read-write, can upload (but not download), can comment (but not update), etc.

All APIs return HTTP error codes.

Database design

One approach is to store files in a blob store, such as Amazon S3, to store file contents. It would be scalable.

I get an impression that blob stores would not allow in-file editing. For example, if I have a doc file and added a couple of lines in it, I suspect you would have to upload the entire file, instead of storing the delta. I will leave this as a homework. If S3 (or another blob store) allows updating a file with deltas, no problem. If none of them do, we might have to come up with a special purpose file storage system.

Metadata would exist:

File name
Location (e.g. parent directory ID)
Sharing options
Groups of users
What kind of access rights (e.g. read/write, read-only, comment only ...) each group has
Owner
Created time
Last edited time
File checksum

I would store this in a relational database. Size will be in the comfort zone of RDB. Having ACID consistency for metadata would be highly desirable.

High-level design

It would be smart to use CDN. Not all, but some files will show "write-once, read many times" characteristics. For example, a famous person uploads some announcement and shares it with many users. In such a case, CDN would cache this file. The many users who read it would access the copy in the CDN, instead of hitting the origin service.

An important design aspect is splitting up a big file into multiple chunks (e.g. 1MB each chunk). This would:

Help upload and download efficiency, as we can parallelize upload & download, and retry when transmission fails.
Help deduplication in case the same chunk appears in multiple files.
Help de-fragment the storage system, as we can move chucks around for replication and partitioning.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Write path consists of:

File upload. This has provide enough upload bandwidth. We probably want to support SFTP based upload, as I believe it is more efficient than HTTP based upload.
Some metadata, e.g., user ID, creation time, etc., will be written to RDB.
File upload will create a message in Message Queue, which would trigger several async jobs. E.g., scrutinize banned contents, scanning for virus and other malicious contents, generating previewed files, etc. The key is that the async jobs can work independently from the main upload path.
Another async task is to replicate uploaded files to multiple locations.
File sharing, along with updating other metadata, is taken care of by Meta Data service.

Read path consists of:

Downloading is taken care of by File Download Service.
Preview is shown by Preview service.

Services are split in a way:

Upload service as the workhorse, high throughput, service that takes care of the critical functionality of uploading files.
Async tasks who can work independently from the upload service.
Read paths are split by DownloadService and Preview Service.

Data replication:

Each uploaded file should be replicated to: one copy in same data center for a quick recovery, one copy in the same region, and finally in a data center in another region in case of a disaster.

Data partition:

Partitioning data so that clients are close to the data (e.g. files uploaded by a customer in Japan can be stored in a data center in Japan) would improve read performance.
Geo location based partitioning also provides defense against disaster recovery, so this is an idea worth exploring. We should do performance measurement and see if this benefit outweighs the challenge (more complex architecture, more cost for data centers).

API Gateway authorizes user's access requests via OIDC. The client would present OIDC scopes it has. API GW makes sure the scope is enough to grant access to the files.

*** Algorithm:

One of the most important aspects is to make sure files are uploaded correctly, and stay accurate.

Before file is uploaded, the client should take a checksum (e.g. SHA1) of the file, and send it to the server, along with the file itself.
One of the async jobs should read the file from Blob Store, calculate the checksum, and make sure it matches the checksum sent in (1).
Periodically, another batch job should calculate this checksum and make sure the data stays intact. This is particularly important when data moves (e.g. recovered from a backup copy, or moved to a different data center).

Trade offs/Tech choices

Another interesting trade-off is privacy vs data durability.

Privacy laws require that users can delete their data completely.

However, to be able to recover data from a disaster, it's important that the files are copied for backup in multiple places.

We need to have a consistent and accurate index so that we can find all the copies before deleting them. Such a data should be stored in a RDB for high consistency.

It would be important to have a good UI to make sure the user would understand the data will be completely wiped out, and cannot be recovered, once removed.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Markdown supported