Requirements


Functional Requirements:


  • Allow users to upload files to the system.
  • Enable users to download uploaded files.
  • Ensure synchronization of files between local and server storage.



Non-Functional Requirements:


    List the key non-functional requirements (eg low latency, scalability, reliability, etc.)...


API Design

  1. HTTP_PUT upload/<path_to_file>?chunk_offset=0,chunk_size=1024,total_size=131072,token=some_base64_token
  2. HTTP_GET file_info/<path_to_file>?token=some_base64_token returns a json of file info including size, file time, etc
  3. HTTP_GET download/<path_to_file>?chunk_offset=0,chunk_size=1024,token=... returns the corresponding chunk
  4. HTTP_GET changes/<client_device_id>?limit=500,token=... returns a list of new, deleted, and changed files that are newer than the given timestamp.



High-Level Design

  1. Auth service gives a token for the session. The token embeds / determines the user's or project's root directory.
  2. Sync is initiated by the client. It first queries for new changes, and then download the changed and new files or chunks.
  3. Internally, files are stored as chunks.
  4. write path 1, data: web tier perform chunking and raid5, shard the chunks and send them to storages
  5. write path 2, metadata: for each chunk, web tier compute basic info like file path, size, time, owner and project, and write to multiple metadata storage service
  6. write path 3, changelog: a changelog service works as a write queue, that can merge events. It is also responsible to update the completeness flag in the metadata
  7. Garbage collection: a background service scans metadata, if a file is incomplete for >7 days, it is deleted.
  8. read path 1, metadata: web tier queries one of the metadata service and returns
  9. read path 2, file content: web tier get the contents using the same sharding rules from storages.
  10. read path 3, changelog: changelog of a project / root directory can be queried from the changelog service. To be extra safe, timestamps should be computed at the server, not client. The changelog serivce keeps reading the changelog until it sees the the specified client_device_id, and then return everything seen so far. This can be optimized by keeping last_time for each client_device_id separately.


Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.