Design Dropbox - System Design

Requirements

Functional Requirements:

Allow users to upload files to the system.
Enable users to download uploaded files.
Ensure synchronization of files between local and server storage.

Non-Functional Requirements:

Durability: once a file is successfully uploaded, it's not lost even when server crashes
Throughput of the system for each user should not be significantly lower than the network bandwidth
Scalability: system can be easily scaled to support more users
Security: files of an user should not be accessible to other users

API Design

POST login(user_credential) -> token
- For first time log-in, server returns a refresh token and JWT (short-lived token for accessing the APIs - denoted as "token" for brevity).
- When JWT expires, use refresh token or force re-login
GET view_files(token) -> file_url[]
- Return all files and directories under the root directory of the users
POST upload_file(token, file) -> file_url
- This API initiate the upload process and return the URL of that file
- Use should receive a progress bar for the upload process
POST download_file(token, file_url)
- When user clicks a file to download, the server will initiate the downloading process
- Return code 404 if file not exist
POST sync(token, local_file, file_url)
- For synchronization of files between local file and server storage.
- User should give all the necessary permission to access the local file.

User flow: after logging in, user can view their existing files on Dropbox (/view_files), then upload a file via /upload_files, or download a file via /download_file.

High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.

Main components:

Client: web UI
Nginx: reversed proxy + load balancer
Server: main business logic (write to DB, write to S3)
S3 storage: store actual file content
CDN: cache frequently-requested files
- We cache small-to-medium files (e.g. document) because they are most frequently requested (user is unlikely to download multi-GB files multiple times a day)

Capacity estimation:

Assume we allow each user to have 5TB storage max, but average at 50GB per user, 100M users
- 50GB * 100M users = 5,000 PB
- => Use S3 for this scale
Each user upload 1GB per day, but download 10GB per day => read-heavy workload
- For example, upload a PDF file from computer and view it multiple times using phone
We also need to store the list of files (URLs) that user has => Assume each user has 500 files, each is a VARCHAR(225) => each record is 300 bytes => total 30GB.
- A single MySQL instance without shard can easily handle this load

Table schema:

file table: file_id, user_id, url, cdn_url, check_sum, last_updated
- Use user_id as index for fast search of all files of same users

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

Load balancing & Scalability & Availability:

As server will need to run `write` for upload/download file and `fsync` for synchronization, they are both CPU-intensive operation => we need to distribute the load evenly => load balancing based on server load
Server should send regular health check back to the load balancer (Nginx) and Nginx can route next request to server with lowest load
Setup auto-scale for server to handle more load
Setup multiple instances of load balancer and server across different data centers to ensure availability
S3 has virtually unlimited storage with durability guarantee, so we don't need to worry about this

Security:

Use pre-signed URL for S3 / CDN objects and return this URL to authenticated user. As this pre-signed URL has expiry time, we can ensure that stolen URL cannot be used for long.
We let S3 / CDN handle the case when malicious user tries to guess the URL of random object (they should block this malicious user)