Design Pastebin - System Design

System requirements

Functional:

Users should be able to paste text in different formats
Users should be able to view the pasted text by hitting a unique URL
Pastes should have an expiration policy based on date/view count
Track analytics like views and clicks
users should be able to search for pastes

Non-Functional:

System should be scalable
low latency when retrieving a paste

Capacity estimation

1 million DAU

load: 100 million redirects, 1 million writes a day, 11 writes/sec, 1100 reads/sec

storage: 5MB avg file size, 5TB/day, 2 years of DB storage = 3000TB storage for 2 years

bandwidth: 5555MB/s

resources: 10ms of cpu time, 11110ms, 12 cores are needed

API design

createPaste(apiKey, metaData, pasteXML)

redirectPaste(apiKey, url)

searchPaste(apiKey, keyWords)

Database design

3000TB for 2 years storage

read heavy

cassandra for storing analytics:

Analytics Table:

paste_id_pk

url_clicks_nbr

views_nbr

User Table:

user_id_pk

name

MetaData:

metaData_id_txt

file_type_txt

file_size

We can use mongoDB to store our pastes

Paste schema:

File contents

expiry date

High-level design

flowchart TD

B[client] --> C{LB}

C --> D[Create service]

C --> H[Redirect service]

C --> I[Search service]

D --> E[(Database)]

E --> F[(Replica 1)]

E --> G[(Replica 2)]

H --> J[(CDN)]

J --> F

I --> J

Request flows

Creating a paste goes through the create service and saves to the DB and Replica Dbs.

urls can be created using a hashing of the creation timestamp and paste title to guarantee uniqueness.

Redirecting from a url hits the redirect service which will check for the file in the CDN.

if a miss it will pull it from the the replica DB.

Search will hit the search service that will search the cache using elastic search for key words

Detailed component design

our Db for file storage will use mongo DB and we can use cassandra to save analytics/metadata etc..

our DB can be sharded based on file type and region

our CDN will have edge servers in multiple regions, it will have a refresh after policy to remove expired pastes.

our load balancers can use a least connections + round robin balancing for simplicity

our Search service can use elastic search to find matches based on keywords

Trade offs/Tech choices

We use cassandra since we do a lot of write operations instead of a sql DB which would be slower with writing operations.

We use database replicas which will increase latency writing to 2 databases.

our CDN will be able to speed up operations by caching files in edge servers.

Failure scenarios/bottlenecks

our LB can fail so we need multiple instances of it just incase.

our services can fail but since they are stateless we can just add more instances easily

our CDN can fail so we need multiple edge servers in case of a failure

the DB may fail so we use multiple replicas to keep data integrity

our replicas may fail so we can use kuberenetes to bring them back

Future improvements

We can improve our searching algorithms in the future to allow more searching options.

How we create urls for each paste can also be updated in the future.

Adding data centers in different regions will speed things up in cache miss scenarios.

update our LB algorithim in the future if needs be