System requirements
Functional:
- Users should be able to paste text in different formats
- Users should be able to view the pasted text by hitting a unique URL
- Pastes should have an expiration policy based on date/view count
- Track analytics like views and clicks
- users should be able to search for pastes
Non-Functional:
- System should be scalable
- low latency when retrieving a paste
Capacity estimation
1 million DAU
load: 100 million redirects, 1 million writes a day, 11 writes/sec, 1100 reads/sec
storage: 5MB avg file size, 5TB/day, 2 years of DB storage = 3000TB storage for 2 years
bandwidth: 5555MB/s
resources: 10ms of cpu time, 11110ms, 12 cores are needed
API design
createPaste(apiKey, metaData, pasteXML)
redirectPaste(apiKey, url)
searchPaste(apiKey, keyWords)
Database design
3000TB for 2 years storage
read heavy
cassandra for storing analytics:
Analytics Table:
paste_id_pk
url_clicks_nbr
views_nbr
User Table:
user_id_pk
name
MetaData:
metaData_id_txt
file_type_txt
file_size
We can use mongoDB to store our pastes
Paste schema:
File contents
expiry date
High-level design
flowchart TD
B[client] --> C{LB}
C --> D[Create service]
C --> H[Redirect service]
C --> I[Search service]
D --> E[(Database)]
E --> F[(Replica 1)]
E --> G[(Replica 2)]
H --> J[(CDN)]
J --> F
I --> J
Request flows
Creating a paste goes through the create service and saves to the DB and Replica Dbs.
urls can be created using a hashing of the creation timestamp and paste title to guarantee uniqueness.
Redirecting from a url hits the redirect service which will check for the file in the CDN.
if a miss it will pull it from the the replica DB.
Search will hit the search service that will search the cache using elastic search for key words
Detailed component design
our Db for file storage will use mongo DB and we can use cassandra to save analytics/metadata etc..
our DB can be sharded based on file type and region
our CDN will have edge servers in multiple regions, it will have a refresh after policy to remove expired pastes.
our load balancers can use a least connections + round robin balancing for simplicity
our Search service can use elastic search to find matches based on keywords
Trade offs/Tech choices
We use cassandra since we do a lot of write operations instead of a sql DB which would be slower with writing operations.
We use database replicas which will increase latency writing to 2 databases.
our CDN will be able to speed up operations by caching files in edge servers.
Failure scenarios/bottlenecks
our LB can fail so we need multiple instances of it just incase.
our services can fail but since they are stateless we can just add more instances easily
our CDN can fail so we need multiple edge servers in case of a failure
the DB may fail so we use multiple replicas to keep data integrity
our replicas may fail so we can use kuberenetes to bring them back
Future improvements
We can improve our searching algorithms in the future to allow more searching options.
How we create urls for each paste can also be updated in the future.
Adding data centers in different regions will speed things up in cache miss scenarios.
update our LB algorithim in the future if needs be