Design Pastebin - System Design

Requirements

Functional Requirements:

Create and share paste
- the system should be create paste for user and share that paste via a unique url
Expiration of paste
- allow the users to set the expiration of paste, which, after exceeding assigned value the paste will be deleted
Unique URL/ID generation
- the user should be able to robustly generate unique id for each url coupling with collision prevention mechanism
Paste Retrieval
- users should be able to retrieval paste using unique url
User anonymity
- Allow users to create paste without needing to create an account to preserve simplicity.

Non-Functional Requirements:

Low response time
- < 100 ms for saving text
- < 100 ms for retrieve text
Durability
- the data must be retained until it expires
High availability
- 99.99 % uptime
Scalable
- able to handle traffic spike
Security
- data rest: user text is encrypted with standard algorithm such as RSA
- data transit: implement TLS for the website
Consistent of user experience
- the paste must be available to user immediately after its creation.

Capacity estimation

DB:

Assuming the Size of text on average =50kb per paste

Assuming 100,000 pastes per day * 50 kb =5GB per day of storage

5GB *365 = 1.825TB of storage for text files in our data base.

Traffic:

Writes:

(100,000/(24*360))=1.16 pastes per second

1.16* 50 kb, roughly 60 kb per second for writes.

Reads:

Assuming 1:1000 read to write ratio:

60 kb*1000= 60 MB

Cache:

1.825TB * 0.2= 256 gb

API Design

Create paste:

POST /{version}/paste

body: { text: str,

ttl: Optional[datetime]

}

response: { state: ENUM,

url_id: str

}

Get paste:

GET /version/paste/{url_id}

reponse: { status: ENUM,

paste_text: str

}

High-Level Design

ID generation:

the id will be alphanumeric string with length = 8

using secure random algorithm

Data storage:

database used will be noSQL db (Key Value Database to be specific) as we just need to map each paste text (value) to the unique_url (id)

Request flows

Same as above high level design

Detailed Component Design

Gen UID Service:

For paste id, we will use secure random to generate alpha numeric string whose length equal to 8 (1/62^8 probability for collision). In case of collision happen we will random string again.

Data consistency:

as we will use multiple database instance for high availability, we will need to ensure that the data is sync between every instance. In this case, we will apply raft algorithm to manage multiple instance of database. We will elect one of the instance to be a leader which will handle all the write operation and the rest will become follower that will try to sync the data with the leader. In case of leader has become unhealthy, we will select number of follower as candidates of new leaders.

Paste expiration:

As we allow the user to set expiration condition for every paste we need efficient way to invalidate a balk of paste. For time condition we just need to map expiration_date with the id. If there is request after expiration timestamp we can easily know that this paste is invalid and delete it. In addition we can have the cron job that will periodically check if at the current time are there any paste that is expired if so we can delete them all for reducing storage cost.

Cache:

we can use cache write through to keep the data in cache sync with the data base. Also we need test and set Time To Live for data in cache so that after the reasonable time the item will be clear from cache saving space. In case the data is deleted in the database we also need to set up trigger to make sure that the data is also deleted from cache.
Algorithm
- LRU: least recently use cache is one of the more popular cache algorithm thanks to its simplicity. Basically, it just have predefine size and if cache surpasses this limit it will remove the least recently used item from cache to make room for new item. This algorithm is one of built in algorithm that Redis has, however, the usage pattern isn't entirely suitable for this algorithm as it doesn't account for the how frequent it is used.
- LFU: least frequently use cache. As the name suggest it is like an extended version of LRU as it also take the number of item usage into account. So, it is more suitable for our usecase.

Scalability:

We will deploy stateless service on container orchestration service such as AWS ECS. In this case we can scale up/down our container on a whim. We can also set up condition for scaling such as when the request amount of certain service exceed our predefine value. They will add more instance to serve the upcoming load. Also we can adopt fault tolerant architecture by using spot instance to save the cost of infrasturcutre.

Failure scenarios/bottlenecks

Running out of hashes for our paste url
Data base goes down
Database gets large
Too many requests

Future improvements

We can use a larger hashing function to create more rows
Utilize a write ahead log to restore
We can horizantally scale with each node serving a range of hash values, we are also running jobs to clean our db after expiration.
If they are reads we can add more servers, if they are writes we need to scale our db and add more servers