Requirements
Functional Requirements:
- Allow users to upload files to the system.
- Enable users to download uploaded files.
- Ensure synchronization of files between local and server storage.
Non-Functional Requirements:
- Durability: once a file is successfully uploaded, it's not lost even when server crashes
- Throughput of the system for each user should not be significantly lower than the network bandwidth
- Scalability: system can be easily scaled to support more users
- Security: files of an user should not be accessible to other users
API Design
- POST login(user_credential) -> token
- For first time log-in, server returns a refresh token and JWT (short-lived token for accessing the APIs - denoted as "token" for brevity).
- When JWT expires, use refresh token or force re-login
- GET view_files(token) -> file_url[]
- Return all files and directories under the root directory of the users
- POST upload_file(token, file) -> file_url
- This API initiate the upload process and return the URL of that file
- Use should receive a progress bar for the upload process
- POST download_file(token, file_url)
- When user clicks a file to download, the server will initiate the downloading process
- Return code 404 if file not exist
- POST sync(token, local_file, file_url)
- For synchronization of files between local file and server storage.
- User should give all the necessary permission to access the local file.
User flow: after logging in, user can view their existing files on Dropbox (/view_files), then upload a file via /upload_files, or download a file via /download_file.
High-Level Design
Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.
Main components:
- Client: web UI
- Nginx: reversed proxy + load balancer
- Server: main business logic (write to DB, write to S3)
- S3 storage: store actual file content
- CDN: cache frequently-requested files
- We cache small-to-medium files (e.g. document) because they are most frequently requested (user is unlikely to download multi-GB files multiple times a day)
Capacity estimation:
- Assume we allow each user to have 5TB storage max, but average at 50GB per user, 100M users
- 50GB * 100M users = 5,000 PB
- => Use S3 for this scale
- Each user upload 1GB per day, but download 10GB per day => read-heavy workload
- For example, upload a PDF file from computer and view it multiple times using phone
- We also need to store the list of files (URLs) that user has => Assume each user has 500 files, each is a VARCHAR(225) => each record is 300 bytes => total 30GB.
- A single MySQL instance without shard can easily handle this load
Table schema:
- file table: file_id, user_id, url, cdn_url, check_sum, last_updated
- Use user_id as index for fast search of all files of same users
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
Load balancing & Scalability & Availability:
- As server will need to run `write` for upload/download file and `fsync` for synchronization, they are both CPU-intensive operation => we need to distribute the load evenly => load balancing based on server load
- Server should send regular health check back to the load balancer (Nginx) and Nginx can route next request to server with lowest load
- Setup auto-scale for server to handle more load
- Setup multiple instances of load balancer and server across different data centers to ensure availability
- S3 has virtually unlimited storage with durability guarantee, so we don't need to worry about this
Security:
- Use pre-signed URL for S3 / CDN objects and return this URL to authenticated user. As this pre-signed URL has expiry time, we can ensure that stolen URL cannot be used for long.
- We let S3 / CDN handle the case when malicious user tries to guess the URL of random object (they should block this malicious user)