Requirements


Functional Requirements:


  • Allow users to upload files to the system.
  • Enable users to download uploaded files.
  • Ensure synchronization of files between local and server storage.
  • Support multiple clients (mobile, desktop 1, desktop 2)




Non-Functional Requirements:


  • High scalable
  • Synchronization between different clients should be fast


Capacity Estimation

  • We have 1 billion users
  • 1000 files per year
  • The average file size is around 10 MB file
  • This results in roughly 10EB of data
  • The metadata is 1KB per file
  • This results in our metadata DB need around 1 PB of storage.



API Design

  • Upload file (File name) -> success with file link
  • Download_file (file link) -> receive file
  • Synchronization (File ID, last Updated timestamp) -> List of files



High-Level Design

We have the following entities in our design


  1. Users (list of clients, list of files that the user owns)
  2. Files (individual file information). This is the file ID, a blob ID to point the storage, metadata like file last updated, creation date, privacy settings,
  3. Links (URL link, file ID)



Sequence of steps

Upload a file

  1. Client starts Upload_file(API)
  2. HTTP request goes to the API Gateway
  3. The API gateway routes the request to the file service
  4. The file service creates a File in our Database
  5. The file service then returns a BLOB storage link to the client for the client to then upload to Blob DB
  6. Afterward the blob storage upload is completed the client then passes finalized BLOB ID into the file service
  7. The file service marks the File in our database as complete and generates a link for the file


Synchronization

  1. Synchronization happens as part of the uploading the file step
  2. After step 6 of the upload file is complete, we can start preparing to synchronize other devices
  3. Clients automatically poll the server every 1 min or 30seconds or so. When they poll the they submit their list of files and their files last updated timestamp. If theres a mismatch for files or a some files are missing
  4. Then the file service looks up the Database and returns the various files and their blob links
  5. The client then fetches and updates each file by fetching from the blob DB


Download a file

  1. Client requests to download a file
  2. HTTP request goes to API gateway
  3. the file service looks up the File in our database
  4. The file service returns the blob link to download the file
  5. Client interacts with the Blob DB directly downloading the file





Database Design

Because the size of the files are so big, we are abstracting much that away by using this global blob DB. This makes it so that the vast majority of the file storage is handed by something like amazon S3 or google GCS.


The important metadata information for files are kept in our file Database thats in use by our file service. The database is mostly going to be queried for the file URL and the file ID. We need to have indexes on the URL and file ID. The URL is for downloading the file for other users. For different clients, they would have the internal file ID and we need to be able to return the BLOB link for the file ID.


Another option is to have a separate table for file URLs that return the FILE ID associated with that URL. I like that more.





Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.


Since so much data exists, we need to split the data across multiple database instances.

We're going to partition the data by fileID. This splits the data roughly evenly across nodes. Another way is to partition by the user, but that results in hot users overload certain nodes. Partitioning by nodes can result in a hot file overloading a node.


Our file service is going to need to handle roughly 30M files per second. At 1 KB per file, this is 30GB/s of data traffic coming in just for the file. This is probably close to above our theoretical network bandwidth for a single node. And the node itself has to do some work, so we assume the file service needs to be scaled. We add a load balancer to the API gateway so that it can choose between different nodes of the file service. Since each file from the user is expected to be different we can generally assume no concurrency issue between the different file services. In normal operation each file service node will write the data into the file database, generate a blob DB link for the user to upload to, there should be no conflicts between the files.


How to make synchronization fast?

Currently each time a new client of a user logins, they send all the files they have to the file service and their last updated timestamp. The file service then needs to look up each file in the file Database and return the relevant blob links if they need to be updated. For users with 1000s of files, this can be very slow process even if there is only 1 or two files to be updated. One way to solve this is to maintain a queue that has a separate channel for each client. Now when a file is uploaded from one client, we write the blob Link of that file to a channel each client is subscribed to.


How to support updates on files that are large?

One issue with our current design is that we rely on the user to upload and update the blob DB. This is good for a new file, but for files that are updated, we don't necessarily want to upload a new blob into the DB and definitely not the full file. Instead we should have a separate API for updating a file vs uploading a new file. For updating a file, the user uploads the FILE ID and the diff. The file service will be in charge of changing the blob file in the blob DB.


Instead of having the client 2 synchronizing redownload files from the blob DB, instead the update service is writing diffs into the pending updates queue. When the client tries to sync, the update service reads the pending updates queue and returns the list of diff history. The client 2 then performs those diffs on their files. This assumes the files exist, for files that the client has never downloaded, they would have to retrieve from the Blob DB.