System requirements


Functional:

User Auth and Authorization

File Management

Sharing and collaboration

VCS

Search on file system

Notification of file events

Offline Access.

Security and Encryption


More description of system: System acts like a network file system to end user. Each user will have access to logical storage space and access control possible at folder level. End/Edge application get notification about shared file/folder modification events. End user should be able to search on file system owned /access available to.


Non-Functional:

Highly available

Scalable

Secure

Performance



Capacity estimation

User count: 100 Million

Storage size per user: 3G Average

Write per second: 5/sec-user

Read per second: 10/sec-user

60% of total operation on storage is Read

40% of total operation on storage is write






API design

We need following group of API


User management REST APIs: For user CRUD. Associate User to group. Associate Group to different roles. Associate Roles to different permission.


File management API's : APIs for CRUD on files. API's will be role based access controlled. Highlighting some of the important API's like Upload, Download


Upload API:

Method : HTTP POST

URL: v1/file/upload

Header: UserName: <>, access_token:<>,

content-type: multipart-form-data

content-disposition-type: filename


Download API:

Method : HTTP GET

URL: v1/file/download/:filepath

Header: UserName: <>, access_token:<>,

content-type: multipart-form-data

content-disposition-type: filename




Search API:

Method : HTTP GET

URL: v1/file/search/:searchString

Header: UserName: <>, access_token:<>,

content-type: application/json

response:{

matches:["files"]

}



Notification API:

For realtime file change update Websocket subscription can be initiated by client. Changes to allowed file system will be sent over websocket channel.




Database design


User and Permission management


RBDMS will be used to store user and associated group and permission information.



Files will be stored in distributed file system. Which is sharded and replicated.




High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...







Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Distributed storage detailed design:


Goal of this system is scalable, reliable and Highly available Distribured file storage


  1. Will have multiple storage machines with with specialised storage hardware HDD/SDD attached.
  2. Software layer/agent running on storage machine collect all the machine storage and form a logical storage layer.
  3. Storage types of the system is File storage.
  4. All the storage machine will be connected through RAFT cluster.
  5. Data will be distributed based on consistent hashing technique with storage nodes on hash ring and user hash on storage ring.
  6. File content will be replicated across multiple storage machine to ensure high availability.
  7. Write request can come to any node request will be routed to shard which is owner of the user requested storage. File will be written and Hash of file will be compared with hash provided during upload. Same content will be replicated across followers. Notification of file activity will be sent to notification service.
  8. When read request comes Distribured file storage request will be routed to shard responsible for user storage. file stream will be opened to client.
  9. Storage system encrypts data in disk and keys needed for encryption will be fetched from key management system.



Trade offs/Tech choices

Instead of building our own custom storage system we can use battle tested storage solution like S3, Azure disk.



Failure scenarios/bottlenecks

Key management service should be very secure it need more detailed discussion since key is lost encrypted data cant be recovered.



Future improvements


Disk encryption needs improvement. Cryptographic keys management need detailed design.