System requirements


Functional:

  • upload/download files from multiple devices
  • file changes synced among all the devices
  • file versioning


Non-Functional:

  1. Data durability. Once file is uploaded, it should never get lost.
  2. Availability
  3. Scalability. Many users.
  4. Strong Consistency. All clients should see the same files.
  5. Security. Files can be accessed by the owner and users with explicit sharing permissions. By default, its't not shared by anybody other than the owner.
  6. Response time: reasonable. Would like to show a home directory quickly (e.g. sub second), but if upload & download takes 1 - 10s depending on file size, that would be acceptable.


Capacity estimation

Suppose we have 100M users/10M DAU.

  • Each of them create 10 files a day. 10MB in average and it could be 1KB - 100GB
  • 100 reads a day


So around 1.5kwrite QPS and 15k read QPS.




API design


Metadata Server

  1. upload(user_ID, file_content, file_name, parent_directory_ID) -> returns file_ID in JSON along with some metadata. This initiates an upload process. It returns the URL to which update_chunk() should send chunks to.
  2. update(user_ID, file_ID, file_content) -> replaces the whole file
  3. mkdir(user_ID, parent_directory_ID, directory_name)
  4. download(user_ID, file_ID)


Chunk Server

  1. upload_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> uploads a specific chunk of a file
  2. update_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> replaces a specific chunk of a file
  3. download_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> uploads a specific chunk of a file




Database design


User(user_id, email....)

File(file_id,file_name, file_version, user_id)

Chunk(chunk_id, file_id, )

Devices(device_id, user_id, )



High-level design


We basically split into 3 big modules:


Metadata Server:

  • CRUD files metadata
  • Fanout any file updates to notification server

Chunk Server:

  • deduplicate file chunks
  • upload chunks to cloud storage
  • update per-file chunk information to metadata server


notification server

  • poll the file updates from the metadata server and send notification to the clients through long polling





Request flows


Uploading path:

  • The clients calls the metadata server first to creat a file and marked it as `Uploading`
  • The clients calls the chunk server to send data
  • The chunk server forward data to cloud storage
  • when fowarding is down, the chunk server call metadata server to update the chunk information and mark it as `Ready`. Also, it responds client with OK


Notification path:

  • The notification server gets called and it notifies the client with the updates



Downloading path:

  • the clients call the metadata server to get the metadata informtion
  • download the chunks from the chunk server
  • the chunk server gets the chunk information from metadata server
  • the client downloads all the chunk from the chunk server and compose the file in local





Detailed component design


Chunk servers.

  • parallel uploading: first of all we can split a big files into small chunks, let's say 10MB each. It can be done on the client side so that the client can fully utilize the network bandwidth. Same for the chunk server to the cloud storage hop
  • data deduplication: chunk servers can calculate the hash value of each chunks and do a dedup by looking up the metadata server. This should be pretty common in file versioning where there could be very small edition on one files.
  • data compression: it can greatly save the storage size by compressing the file blocks
  • data integrity: we can ask the client to calculate the hash value, sent to the metatadata server and the chunk server to calculate that again to ensure the data isn't corrupted. If yes, we have to ask clients to resend corrupted chunks.


Polling/Long-polling/Websockets

  • The notification between the notification server and the client can be supported by multiple ways. Here polling may be not ideal as it is waste of resource. As for long-polling and websockets, it depends. Websockets is more performant but relatively complicated protocol than long-polling.



How to save storage cost?

  • data deduplicate: mentioned above
  • set version limit: we can set a limit for versions. Beyond certain limit, we don't support them.
  • use cold storage: this is a feature of Cloud Storage, we can let Cloud Storage to transfer the data that hasn't been accessed for 2 weeks to cold storage.



Data replication/durability


Normally, we can relies on the Cloud Storage's data replication for our durability. Usually they already did that by replication and we don't need to do that again.



Since metadata server does lots of read, we can have a in-memory cache to help with that


The database choice on metadata DB: we can choose relational DB as the write QPS ins't high and we need strong consistency






Failure scenarios/bottlenecks


It is possible transimission failed between client and chunk server side or the chunk server to cloud storage side. As mentioned above, we use the database status tracking if the uploading is sucess or not and use checksum to decide if the data is corrupted or not. Otherwise we have to ask clients to retry with exponential back-off



Sharding: we need to shard our metadata DB in the future. Apparently we can shard by file_id. The another option is user_id. The advantage of using file_id is to prevent hotspot issue though sharding by user_id is better during fanout.



Scalab


Future improvements


We can add monitoring to monitor our service SLA. It should be very reliable.