System requirements


Functional:

  1. Authentication (Role and permission based authentication)
  2. A user can upload files.
  3. A user can download a file.
  4. A user can modify the data.
  5. At any given time multiple people can update a folder.
  6. Sync between different users.
  7. Update a delta
  8. Data storage.
  9. Versioning.
  10. Edge cases : Document partially update.
    1. Document upload failure
    2. client is offline




Non-Functional:

  1. Scalable.
  2. Reliable.
    1. Should not lost any update.
  3. Consistency.
    1. No data lost
  4. User experience



Capacity estimation

Total user : 100 Million


Lets assume user a user data is : 200Bytes this table is only for user meta data

User space : 100*200 -->. 10Billion --> 10 GB

Storage size for a day :

10Million * 5*1000*1000 -->. 50000Billion Bytes --> 50TB/Day

TPS : 10Million /10000-> ~10TPS



For data storage

For 5 years : 50*5*365 TB --> 400*250 --> ~100PB

Lets assume we are having master slave replica with 3 slave

So total size : 300PB






API design

Post : /v1/upload --> data as bytes

Response will be {documentId:"xxxfstw7623672883"}

Get : /v1/docuemnts?userId=''

Response --> {

docId,

created at,

lastUpdated,

LastUpdated by

}

Put : /v1/update/{documentId} --> body will content

Post : /v1/share/{doumentId} Body {List:{usrId}}





Database design


User table ->{

userId: string(PK)

email : string(UK)

createdAt:

updatedAt

}




userDocumentMeta ->{

docId : PK

contentAddress: --> Location of data

createdby : userId; --> Index

createdAt :

}


DocumentVersion{

id : PK

docId: FK parent table --> userDocumentMeta

version: String

comment : String

created at : timestamp

updated by : userId

documentAddress: blockId

}



As update is allowed so recommended is to use block storage

ContetTable {

id PK

blockId : FK parent will be DocumentVersion

content: String

offset:

nextBlock:


}


Foreign keys :

userDocumentMeta : createdby column

userDocumentMeta --> Index will be userId



High-level design


Component ->

Client.

LB

DropBoxServer

DB

BroadCastService : Queue based notification(Client can connect and update latest meta data )



Client :

  1. It will maintained data locally and stored it will it.
  2. Once client is online it will connect to backend and push the changes.


Server

  1. Client login to dropBox
  2. After authentication user uploaded a documents.
  3. It connected to server and upload document.
  4. Once server received all doc it will start processing.
  5. Once data come for update, new version will be created and update the data
  6. Push the update to each client.


How it will be scaled for 1Million user :

As we have DropBoxServer services are running behind LB they can be horizontally scale on demand basis.

To scale our data base we can user sharding on basis of userId and can user muli master model.

to identify the changes of block we can use tree approach to traverse the changes in which offset.





Request flows

Client APK/Client side:

  1. Open any folder, it will fetch from server(if not available at client).
  2. Update the document/folder.
  3. Save change locally.
  4. Publish changes to backend.


BackEnd:

Upload/update flow

  1. Get changes from client1 for doc1.
  2. Versioning of new data in metedata table.
  3. Update the data in document.
  4. broadcast all request to connected client.


Download flow:

  1. User authenticate.
  2. query all document from table on basis of userId
  3. Publish all doc Id to client.
  4. User can choose any one docId.
  5. Client will call a download API with docId




Detailed component design

How client maintained the data locally?

How updated data stored in backend?

How data will be synced from one client to another client?

How delta will be stored at backend?


How client maintained the data locally?

  1. Client will connect to server and download all list of doc for a user.
  2. User can pick any document and start updating it.
  3. Changes are saved locally and periodically broadcast to backend.
    1. We have two approach either we can push changes immediately as soon as changes are done. Problem --> Scalability issue as each minor changes need to maintained at back end has to broadcast to every user
    2. Or we can push changes periodically.
      1. Advantage : Scalable solution and we can tune the durations for cost effective(DB changes)
      2. Disadvantage : not real time, if someone required changes real time
  4. Get update of a version of document and resolve conflicts.

How updated data stored in backend?

How data will be synced from one client to another client?

  1. After authentication
  2. Backend receive the changes from one client.
  3. It will create new version of data and update in datastore.
  4. It will also detect the changes and update the content of document.
  5. It broadcast all changes to the notification servers.


How delta will be stored at backend?

Will convert data into small chunks let assume size of 4Kb --> and this smalle chunk will be stored in form of linked linklist



Trade offs/Tech choices

DB : For user and meta storage : RelationDB --> Postgres/Mysql

For data content storage: NoSQL --> DynamoDB/Cassendar (Chooses because of availability’s )



Trade Offs

Client pull vs push data sync to backend.







Failure scenarios/bottlenecks

What will happened if client not able to upload the changes?(or client is offline)


bottlenecks:

  1. Notifying the changes to all the client. For small scale this approach will works.


Data consistency : As data is updating async, we can pick strong consistency



Future improvements

Notifying the changes to all the client. For small scale this approach will works.