Requirements


Functional Requirements:


  • Allow users to upload files to the system.
  • Enable users to download uploaded files.
  • Ensure synchronization of files between local and server storage.



Non-Functional Requirements:


  • Durability: once a file is successfully uploaded, it's not lost even when server crashes
  • Throughput of the system for each user should not be significantly lower than the network bandwidth
  • Scalability: system can be easily scaled to support more users
  • Security: files of an user should not be accessible to other users


API Design

  • POST login(user_credential) -> token
    • For first time log-in, server returns a refresh token and JWT (short-lived token for accessing the APIs - denoted as "token" for brevity).
    • When JWT expires, use refresh token or force re-login
  • GET view_files(token) -> file_url[]
    • Return all files and directories under the root directory of the users
  • POST upload_file(token, file) -> file_url
    • This API initiate the upload process and return the URL of that file
    • Use should receive a progress bar for the upload process
  • POST download_file(token, file_url)
    • When user clicks a file to download, the server will initiate the downloading process
    • Return code 404 if file not exist
  • POST sync(token, local_file, file_url)
    • For synchronization of files between local file and server storage.
    • User should give all the necessary permission to access the local file.


User flow: after logging in, user can view their existing files on Dropbox (/view_files), then upload a file via /upload_files, or download a file via /download_file.


High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.


Main components:

  • Client: web UI
  • Nginx: reversed proxy + load balancer
  • Server: main business logic (write to DB, write to S3)
  • S3 storage: store actual file content
  • CDN: cache frequently-requested files
    • We cache small-to-medium files (e.g. document) because they are most frequently requested (user is unlikely to download multi-GB files multiple times a day)


Capacity estimation:

  • Assume we allow each user to have 5TB storage max, but average at 50GB per user, 100M users
    • 50GB * 100M users = 5,000 PB
    • => Use S3 for this scale
  • Each user upload 1GB per day, but download 10GB per day => read-heavy workload
    • For example, upload a PDF file from computer and view it multiple times using phone
  • We also need to store the list of files (URLs) that user has => Assume each user has 500 files, each is a VARCHAR(225) => each record is 300 bytes => total 30GB.
    • A single MySQL instance without shard can easily handle this load


Table schema:

  • file table: file_id, user_id, url, cdn_url, check_sum, last_updated
    • Use user_id as index for fast search of all files of same users



Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.


Load balancing & Scalability & Availability:

  • As server will need to run `write` for upload/download file and `fsync` for synchronization, they are both CPU-intensive operation => we need to distribute the load evenly => load balancing based on server load
  • Server should send regular health check back to the load balancer (Nginx) and Nginx can route next request to server with lowest load
  • Setup auto-scale for server to handle more load
  • Setup multiple instances of load balancer and server across different data centers to ensure availability
  • S3 has virtually unlimited storage with durability guarantee, so we don't need to worry about this


Security:

  • Use pre-signed URL for S3 / CDN objects and return this URL to authenticated user. As this pre-signed URL has expiry time, we can ensure that stolen URL cannot be used for long.
  • We let S3 / CDN handle the case when malicious user tries to guess the URL of random object (they should block this malicious user)