System requirements


Functional:

  1. Put Data: Users could store any kinds of files, like photo, file, and video
  2. Get Data: Users could download the file which they uploaded before from the Dropbox anytime
  3. Delete Data: Users could delete the files from their storage
  4. List Data: users could see the list of their data
  5. Share Data: users can share files with group of people and assign access rights



Non-Functional:

  1. Availability: highly available
  2. Durability: the data, once upload, shouldn't be lost unless user delete the data
  3. Scalability: The system should be capable of handling billions users
  4. Reliability: Since failures are a norm in distributed systems, our design should detect and recover from failures promptly.
  5. Security: data can only be read or written by permission. By default, it's not shared by anyone other than owner
  6. Response time: reasonable. It should show a list quickly; for upload and download, it could be several minutes depending on the size of the file.


Capacity estimation

  1. 20 million new files will be upload per day

Storage:

average file size = 3 MB, metadata size = 256 Byte

20M * 3 MB = 60 TB/day

20M * 256B = 5 GB/day

for 2 years = * 365 * 2 = 42 PB


Bandwidth: write : read = 1:2

20M * 3M / (24*60*60) =


number of server

Storage * 1000 / 500 RPS


API design

  1. POST upload(user_ID, file_content, file_name, parent_directory_ID) -> returns file_ID in JSON along with some metadata. This initiates an upload process. It returns the URL to which update_chunk() should send chunks.
  2. POST upload_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> uploads a specific chunk of a file
  3. PATCH update_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> replaces a specific chunk of a file
  4. GET download(user_ID, file_ID)
  5. GET mkdir(user_ID, parent_directory_ID, directory_name)
  6. GET preview(user_ID, file_ID) # Nice-to-have
  7. POST share(file_ID, [user_IDs], sharing_type) -> directory_ID can be used in place of file_ID. group_IDs can be used in place of user_IDs. Sharing_type has several options: read only, read-write, can upload (but not download), can comment (but not update), etc. # Nice-to-have



Database design

There are two kinds of data need to be store

  1. File content
    1. NoSQL Blob storage, like Amazon S3 and Azure storage
    2. because the data is unstructured and simple, eventually consistency can meet the requirement
  2. File metadata
    1. RDB
    2. because the data is structured, requires strong consistency, and the size of data is small


MetaData

  • File name
  • File version
  • File chunks list (chunk IDs in S3)
  • Location (parent directory ID)
  • Owner
  • Creation time
  • Last edit time
  • Share option
    • groups of users
    • access rights (read-only, read/write)


High-level design

Client





Request flows

Upload File:

  • The client send a request to upload a file, the request go through the check of auth and rate-limiter of the API gateway and is forwarded to the file upload service. write the metadata to the RDB and return the file_id with some metadata and URL to which update_chunk() should send chunks to.
  • File Upload service returns the URL of Chunk Uploader for file uploading to the client
  • Client chunks the file and send chunks to Chunk uploader
  • Chunk uploader saves the chunks to blob storage
  • After Chunk uploader saves all the chunks, it would mark the file to be ready in Metadata DB


Download File:

  • A Client send a request to download file; if it hits the cache in DNS, return the address of the file.
  • If it is miss the hit, the request go through the check of auth and rate-limiter of Api-gateway is forwarded to the file download service. File download service will search the DB for the url of chunks and use chunks download to download the file



Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks


Error Check

Network failure during the upload process would make chunks corrupt. The chunks can get corrupted even after being stored. it is important to make sure the chunk that is uploaded is correct.

  1. Before the chunk is uploaded, the client should take checksums of the chunks, and send it to the server, along with the chunks themselves.
  2. An async job, triggered during file upload, should read the chunks from the Blob storage, and calculate and check if the checksums match with the one store in the metadata DB
  3. Periodically, another async job should calculate this checksum and make sure the data stays intact. This is particularly important when data moves (e.g. recovered from a backup copy, or moved to a different data center)


Data Duplication

It is also important to replicate data for robustness

Each uploaded file should be replicated to one copy in the same data center and one copy in the same region and in a data center of another region in case of disaster. All of them should be stored as chunks with version numbers, so that only the lost or corrupted chunks can be brought back to production system.



Future improvements

  1. We can add files folder for data
  2. support search
  3. support preview
  4. support collaboration