Design Dropbox - System Design

System requirements

Functional:

Put Data: Users could store any kinds of files, like photo, file, and video
Get Data: Users could download the file which they uploaded before from the Dropbox anytime
Delete Data: Users could delete the files from their storage
List Data: users could see the list of their data
Share Data: users can share files with group of people and assign access rights

Non-Functional:

Availability: highly available
Durability: the data, once upload, shouldn't be lost unless user delete the data
Scalability: The system should be capable of handling billions users
Reliability: Since failures are a norm in distributed systems, our design should detect and recover from failures promptly.
Security: data can only be read or written by permission. By default, it's not shared by anyone other than owner
Response time: reasonable. It should show a list quickly; for upload and download, it could be several minutes depending on the size of the file.

Capacity estimation

20 million new files will be upload per day

Storage:

average file size = 3 MB, metadata size = 256 Byte

20M * 3 MB = 60 TB/day

20M * 256B = 5 GB/day

for 2 years = * 365 * 2 = 42 PB

Bandwidth: write : read = 1:2

20M * 3M / (24*60*60) =

number of server

Storage * 1000 / 500 RPS

API design

POST upload(user_ID, file_content, file_name, parent_directory_ID) -> returns file_ID in JSON along with some metadata. This initiates an upload process. It returns the URL to which update_chunk() should send chunks.
POST upload_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> uploads a specific chunk of a file
PATCH update_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> replaces a specific chunk of a file
GET download(user_ID, file_ID)
GET mkdir(user_ID, parent_directory_ID, directory_name)
GET preview(user_ID, file_ID) # Nice-to-have
POST share(file_ID, [user_IDs], sharing_type) -> directory_ID can be used in place of file_ID. group_IDs can be used in place of user_IDs. Sharing_type has several options: read only, read-write, can upload (but not download), can comment (but not update), etc. # Nice-to-have

Database design

There are two kinds of data need to be store

File content
1. NoSQL Blob storage, like Amazon S3 and Azure storage
2. because the data is unstructured and simple, eventually consistency can meet the requirement
File metadata
1. RDB
2. because the data is structured, requires strong consistency, and the size of data is small

MetaData

File name
File version
File chunks list (chunk IDs in S3)
Location (parent directory ID)
Owner
Creation time
Last edit time
Share option
- groups of users
- access rights (read-only, read/write)

High-level design

Client

Request flows

Upload File:

The client send a request to upload a file, the request go through the check of auth and rate-limiter of the API gateway and is forwarded to the file upload service. write the metadata to the RDB and return the file_id with some metadata and URL to which update_chunk() should send chunks to.
File Upload service returns the URL of Chunk Uploader for file uploading to the client
Client chunks the file and send chunks to Chunk uploader
Chunk uploader saves the chunks to blob storage
After Chunk uploader saves all the chunks, it would mark the file to be ready in Metadata DB

Download File:

A Client send a request to download file; if it hits the cache in DNS, return the address of the file.
If it is miss the hit, the request go through the check of auth and rate-limiter of Api-gateway is forwarded to the file download service. File download service will search the DB for the url of chunks and use chunks download to download the file

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Error Check

Network failure during the upload process would make chunks corrupt. The chunks can get corrupted even after being stored. it is important to make sure the chunk that is uploaded is correct.

Before the chunk is uploaded, the client should take checksums of the chunks, and send it to the server, along with the chunks themselves.
An async job, triggered during file upload, should read the chunks from the Blob storage, and calculate and check if the checksums match with the one store in the metadata DB
Periodically, another async job should calculate this checksum and make sure the data stays intact. This is particularly important when data moves (e.g. recovered from a backup copy, or moved to a different data center)

Data Duplication

It is also important to replicate data for robustness

Each uploaded file should be replicated to one copy in the same data center and one copy in the same region and in a data center of another region in case of disaster. All of them should be stored as chunks with version numbers, so that only the lost or corrupted chunks can be brought back to production system.

Future improvements

We can add files folder for data
support search
support preview
support collaboration