My Solution for Design Google Doc with Score: 8/10

by kraken_pinnacle338

System requirements


Functional:

+ users can create new docs

+ users can update docs

+ multiple users can edit the doc live (limit hundreds)

+ changes should be saved automatically


Non-Functional:

+ latency should be as low as possible

+ consistency is critical

+ authentication


Capacity estimation

+ we should expect billions of documents

+ the number of users being able to edit every doc is limited to hundreds. the reason why it's limited to hundreds is because it's limited to the number of connections a single server can have.

+ docs are expected to be big (hundreds of megabytes)


API design

/api/docs/ [GET]

response{

document_metadata

}

/api/docs/ [PUT]

req: {

document_metadata

actions

author

}

response {

timestamp

}

/api/docs/ [POST]

req: {

doc id

doc name

author

}

response: {

document metadata

}

/api/docs//users [PUT]

req: { allowed_users }


Database design

Table Documents

DocumentId

Metadata


Table Actions

ActionId

DocumentId

Action (added 2 rows, changed the font, etc...)

CreatedTime


we can partition by document id, and by action id. yes, some docs will be more popular. we can mitigate that by adding a cache to avoid hotspots in the DB.


Actions are immutable, once an action is created, it can only be deleted (by the pipeline) or undone by a future action, but it can't be modified.


High-level design

At a high-level, users come to the docs.

They start a persistent connection. We create a broadcasting event based on the docid, everytime a change is made on the doc, we will broadcast to all the people who are making changes on the doc.

Once a doc is created, and a user starts making changes, the system generates actions for every change. The change is sent to the server, the server will broadcast the change to all users. Simultaneously, the server will persist the change in the DB. The DB is just another subscribed user.


If a connection is lost, the user will fetch the document from the DB. A document is constructed by using all the events that led to it. In the background there's a service that will concatenate all the actions and recreate the doc.


An alternative could be to always send the full doc, but that would mean a larger payload, smaller payloads over an open connection is faster.



Request flows

1.- User creates doc

2.- User makes changes to doc and a queue is created

3.- All subscribers of that queue receive the change, by default, the DB is one of the subscribers.

4.- User leaves doc.

5.- Eventually, a pipeline will reconstruct the doc in the DB using all the actions






Detailed component design

The DB needs to be relational since there needs to be consistency + the data between the docs and the actions is hierarchical. I would choose either mysql or spanner because i've used them.


for the pipeline, i would choose apache beam or apache spark.


users will be connected through a websocket that allows bidirectional data.

when users upload rich content (images/video), we don't send the full image/video, we just send a link to a content.


when it comes to the load balancing algorithm, we need to be able to route all users to the same server. using a hashing function based on docid will be ideal. if a server has already too many connections happening, then we will choose a random (available) server. we will store the connection in a mapping (probably redis). when users send requests to connect, we will query redis and redirect to that server if it's available.


Trade offs/Tech choices

i chose to increase the complexity when it comes to building documents with the benefit of reducing the latency. instead of sending the whole document over the wire, i just send small actions (diffs).



Failure scenarios/bottlenecks

what happens when more than 10 users want to edit the doc?

we should display a warning to the user, letting them know that we can't let them edit the doc because too many users are doing it.


how do we handle docs that have thousands of viewers?

we just display to them the version of the doc we have in the DB without the latest changes.


what happens when the server managing the connections dies?

in this case, users will get a notification in the UI, saying that their connection died. they will automatically be tried to be reconnected to a new server.


Future improvements

making sure that the doc, and the server provisioning the pubsub are colocated with the user starting the connection. if most of the users are in a region like west coast, we should try to provision a server in the west coast.


ensure that the reconstruction pipeline has an smart triggering strategy beyond just periodic. this means that when a doc has over 1k changes, we would want to make sure that the pipeline is ran. we could have an additional watchdog that triggers more often and that queries number of visitors per doc and number of changes pending to trigger the pipeline.


how to handle conflict resolution? imagine two users make a conflicting change (one adds a letter, and the other one removes it)

unlike git, we can't ask the user to manually resolve the changes. and we can't trust on the client timestamp to determine who wrote the first one. In this case, we should trust the server timestamps. when a user makes a change, the server replies with a timestamp, and the client should be able to do document reconstruction using those timestamps.