System requirements


Functional:

  1. File Upload: Users should be able to upload files to the system, which includes handling different file types and sizes.
  2. File Organization: Users need the ability to organize files into directories or folders to manage their content effectively.
  3. File Sharing: Users should have options to share files or directories with other users, specifying permissions for access (e.g., read-only or read-write).
  4. Version Control: The system should support versioning of files, allowing users to revert to previous versions if needed, and to see edit history.



Non-Functional:

  • We need the contents of files and directories to be durable - we cannot lose data
  • Files should be secure - people should not be-able to access these files without the right permissions.
  • System should be highly available for reads and writes. Eventual consistency is fine; we don't expect much concurrent access or expect real-time concurrent modification of files.
  • We should have a per-user target of read and write throughput. Assume reads of 100Mbps, and writes of 10Mbps.




Capacity estimation


Assume 100m DAU

2 files read, and 1 written every day. 1MB each file.

Storage added is 100GB per day.

About 30TB per year.

Avg Read bandwidth 200GB/100000 = 20Mbps

Avg Write bandwidth 100GB/100000 = 10Mbps

Assuming average 5000 users per second.

Each one get 100Mbps downloads, 10Mbps upload.

500GBps uplink, 50GBps downlink

Average network bandwidth 500Gbps.

Assuming peak is 10x avg, we should provision 5000Gbps.






API design

POST /create-folder

{

path

}

DEL /content/:path

GET /content/:path


POST /start-upload -> version-number

{

base-version # null, if its a new file

name

path

total-number-of-chunks

}


POST /chunk-upload -> pass/fail

{

name

path

version-number

u64-encoded-content

chunk-number # which of the chunks is in this request

}

DEL /content/:path/:name

GET /content/:path

GET /content/:path/:name?version=vvv&page=xxx

GET /metadata/:path/:name -> {

version-history[] # pairs of version-number and timestamp

users-with-read-permission

users-with-write-permission

}


POST /share/:path/:name {

reads: user-email[]

writes: user-email[]

}



Database design

Path:

  • path-id (primary-key)
  • parent-path-id
  • fully-qualified-path

File:

  • file-id (primary key)
  • name
  • path-id
  • fully-qualified-path
  • user-emails-read-permission[]
  • user-emails-write-permission[]
  • file-version-id[]

File-Version

  • file-version-id (primary key)
  • file-id
  • name
  • path-id
  • chunk-id[]

Chunk

  • chunk-id (primary key)
  • Chunk-Server-Leader-Name
  • Chunk-Server-Follower-Name

Chunk-Server

  • Chunk-Server-Name (primary-id)
  • chunk-id[]


High-level design

We have the front end that gets request from the client to read and write files. To create a file, it sends a request to the chunk allocation service.

This service finds the chunk servers on which chunks for this file can be created and sends requests for doing this to chunk server managers.

Chunk server managers manage multiple chunk service. They send requests to create and delete local files to these chunk servers. For each chunk, they actually allocate two chunk instances on different chunk servers - one master and one replica. They also get heartbeats from these chunk servers. If a chunk server is unhealthy, we have to allocate chunks instances on a different chunk servers.


The front end, the chunk allocation service, the chunk server manager and the chunk server all communicate via Kafka.

The chunk allocation service is doing stateful allocation across chunk server managers and managing directory structure creation. It must checkpoint the state of these allocations back to Kafka in case the chunk allocation service fails. It needs read/write access to the cache.

The chunk server manager is partitioned according to chunk server addresses, so that one chunk server is only served by one chunk server manager at a time. It must move chunks to new chunk servers when a chunk server fails. It needs read/write access to the cache


Since the front end is doing reads and rights directly against chunk service, it must consult the cache to figure out where the chunks of a file live.




Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?