System requirements


Functional:

  1. Authentication
  2. A user can upload files.
  3. A user can download a file.
  4. A user can modify the data.
  5. At any given time multiple people can update a folder.
  6. Sync between different users.
  7. Update a delta
  8. Data storage.
  9. Versioning.



Non-Functional:

  1. Scalable.
  2. Reliable.
    1. Should not lost any update.
  3. Consistency.
    1. No data lost
  4. User experience



Capacity estimation

Total user : 100 Million

Lets assume user a user data is : 200Bytes

User space : 100*200 -->. 10Billion --> 10 GB

Storage size for a day :

10Million * 5*1000*1000 -->. 50000Billion Bytes --> 50TB/Day

TPS : 10Million /10000-> ~10TPS







API design

Post : /v1/upload --> data as bytes

Response will be {documentId:"xxxfstw7623672883"}

Get : /v1/docuemnts?userId=''

Response --> {

docId,

created at,

lastUpdated,

LastUpdated by

}

Put : /v1/update/{documentId} --> body will content

Post : /v1/share/{doumentId} Body {List:{usrId}}





Database design

User table ->{

userId: string(PK)

email : string(UK)

createdAt:

updatedAt

}




userDocumentMeta ->{

docId : PK

contentAddress: --> Location of data

createdby : userId; --> Index

createdAt :

}


DocumentVersion{

docId: PK

version: String

comment : String

created at : timestamp

updated by : userId

documentAddress: blockId

}



As update is allowed so recommended is to use block storage

ContetTable {

blockId : pK

content: String

offset:

nextBlock:


}



High-level design


Component ->

Client.


DropBoxServer

DB

BroadCastService : Queue based notification(Client can connect and update latest meta data )



Client :

  1. It will maintained data locally and stored it will it.
  2. Once client is online it will connect to backend and push the changes.


Server

  1. Client login to dropBox
  2. After authentication user uploaded a documents.
  3. It connected to server and upload document.
  4. Once server received all doc it will start processing.
  5. Once data come for update, new version will be created and update the data
  6. Push the update to each client.






Request flows

Client APK/Client side:

  1. Open any folder, it will fetch from server(if not available at client).
  2. Update the document/folder.
  3. Save change locally.
  4. Publish changes to backend.


BackEnd:

Upload/update flow

  1. Get changes from client1 for doc1.
  2. Versioning of new data in metedata table.
  3. Update the data in document.
  4. broadcast all request to connected client.


Download flow:

  1. User authenticate.
  2. query all document from table on basis of userId
  3. Publish all doc Id to client.
  4. User can choose any one docId.
  5. Client will call a download API with docId




Detailed component design

How client maintained the data locally?

How updated data stored in backend?

How data will be synced from one client to another client?

How delta will be stored at backend?


How client maintained the data locally?

  1. Client will connect to server and download all list of doc for a user.
  2. User can pick any document and start updating it.
  3. Changes are saved locally and periodically broadcast to backend.
    1. We have two approach either we can push changes immediately as soon as changes are done. Problem --> Scalability issue as each minor changes need to maintained at back end has to broadcast to every user
    2. Or we can push changes periodically.
      1. Advantage : Scalable solution and we can tune the durations for cost effective(DB changes)
      2. Disadvantage : not real time, if someone required changes real time
  4. Get update of a version of document and resolve conflicts.

How updated data stored in backend?

How data will be synced from one client to another client?

  1. After authentication
  2. Backend receive the changes from one client.
  3. It will create new version of data and update in datastore.
  4. It will also detect the changes and update the content of document.
  5. It broadcast all changes to the notification servers.


How delta will be stored at backend?



Trade offs/Tech choices

DB : For user and meta storage : RelationDB --> Postgres/Mysql

For data content storage: NoSQL --> DynamoDB/Cassendar (Chooses because of availability’s )



Trade Offs

Client pull vs push data sync to backend.







Failure scenarios/bottlenecks



bottlenecks:

  1. Notifying the changes to all the client. For small scale this approach will works.





Future improvements

Notifying the changes to all the client. For small scale this approach will works.