System requirements


Functional:

  • Users should be able to paste text in different formats
  • Users should be able to view the pasted text by hitting a unique URL
  • Pastes should have an expiration policy based on date/view count
  • Track analytics like views and clicks
  • users should be able to search for pastes


Non-Functional:

  • System should be scalable
  • low latency when retrieving a paste



Capacity estimation

1 million DAU


load: 100 million redirects, 1 million writes a day, 11 writes/sec, 1100 reads/sec

storage: 5MB avg file size, 5TB/day, 2 years of DB storage = 3000TB storage for 2 years

bandwidth: 5555MB/s

resources: 10ms of cpu time, 11110ms, 12 cores are needed



API design

createPaste(apiKey, metaData, pasteXML)


redirectPaste(apiKey, url)


searchPaste(apiKey, keyWords)


Database design

3000TB for 2 years storage

read heavy


cassandra for storing analytics:


Analytics Table:

paste_id_pk

url_clicks_nbr

views_nbr


User Table:

user_id_pk

name


MetaData:

metaData_id_txt

file_type_txt

file_size


We can use mongoDB to store our pastes

Paste schema:

File contents

expiry date



High-level design

flowchart TD

    B[client] --> C{LB}

    C --> D[Create service]

    C --> H[Redirect service]

    C --> I[Search service]

    D --> E[(Database)]

    E --> F[(Replica 1)]

    E --> G[(Replica 2)]

    H --> J[(CDN)]

    J --> F

    I --> J






Request flows

Creating a paste goes through the create service and saves to the DB and Replica Dbs.

urls can be created using a hashing of the creation timestamp and paste title to guarantee uniqueness.

Redirecting from a url hits the redirect service which will check for the file in the CDN.

if a miss it will pull it from the the replica DB.

Search will hit the search service that will search the cache using elastic search for key words




Detailed component design

our Db for file storage will use mongo DB and we can use cassandra to save analytics/metadata etc..

our DB can be sharded based on file type and region

our CDN will have edge servers in multiple regions, it will have a refresh after policy to remove expired pastes.

our load balancers can use a least connections + round robin balancing for simplicity

our Search service can use elastic search to find matches based on keywords



Trade offs/Tech choices

We use cassandra since we do a lot of write operations instead of a sql DB which would be slower with writing operations.

We use database replicas which will increase latency writing to 2 databases.

our CDN will be able to speed up operations by caching files in edge servers.



Failure scenarios/bottlenecks

our LB can fail so we need multiple instances of it just incase.

our services can fail but since they are stateless we can just add more instances easily

our CDN can fail so we need multiple edge servers in case of a failure

the DB may fail so we use multiple replicas to keep data integrity

our replicas may fail so we can use kuberenetes to bring them back




Future improvements

We can improve our searching algorithms in the future to allow more searching options.

How we create urls for each paste can also be updated in the future.

Adding data centers in different regions will speed things up in cache miss scenarios.

update our LB algorithim in the future if needs be