Design Pastebin - System Design

System requirements

Functional:

Store a max of 1mb worth of data per user
Store text for a max week per session id
Options to set paste expiration times
Privacy settings to make their pastes public (accessible via link) or private (only accessible to user) and ability to password protect a paste
Support for basic formatting (bold, italics, list etc) or markdown for enhanced presentation
Support for images/files
Search functionality for users to find their own pastes quickly with a tagging system to categorize pastes
Ability to view or revert to previous versions of a paste
Timestamps for when paste was created or last modified
Analytics to view statistics such as view counts on a paste or shares of a paste
Make an api for developers who want to create, retrieve or manage pastes

Non-Functional:

Create a user session using their local storage
Must be highly scalable
Must be highly consistent

Capacity estimation

Lets assume each user can have a maximum of 5 different texts and each being capped to 1MB worth of data. Each text can have a max of 5 tags. Let us also assume this this application is available in 10 countries and has a lifespan of 10 years with an average of 1000 users per country. Lets also assume uploaded images are compressed to a 100kb. Let us also assume we will save up to 10 previous versions of each text with the oldest being deleted.

User {

user_id: 10 bytes

username: 10 bytes

password: 10 bytes

text_ids: 10bytes * 5

}

10 + 50 bytes = 60 bytes

Text{

text_id: 10 bytes

user_id: 10 bytes

(image_ids: any text:any) -> max between these two fields is 1mb

total_memory: 5 bytes

privacy_status: (public,private) 10 bytes

text_link: 100 bytes

text_password: 50 bytes

tags: 50 bytes * 5

last_modified: 8 bytes

text_history: 10mb

}

10b + 10b + 1mb + 5 + 10 + 50 + 250bytes + 8 + 10mb =

11.000343 mb

To_Delete{

text_id: 10 bytes

last_modified: 8 bytes

}

10 + 8 = 18b

Tag{

tag_id: 10 bytes

tag_name: 50 bytes

}

10 + 50 = 60b

Image{

image_id: 10 bytes

}

10 + 100000 = 100010 bytes

Search {

text_id: 10 bytes

text_header: 20 bytes

}

10 + 20 = 30 bytes

60 + 30 + 100010 + 60 + 18 + 11.000343mb = 11.100521mb

11 mb

1000 users per country * 10 countries * 11mb * 10 years = 1.1 Terabyte

API design

input_search_header(text_id, text_header) -> This function will be called when a user creates some text. It will store up to 20 bytes of data starting at the beginning of the text in a key value store in order to be used by a search function

search(text) -> This function will utilize a websocket to take user input data and search for headers that correspond to the user input text

create_text(session_id, text, image_ids, privacy_settings, text_link,text_password,tags,text_history,total_memory,last_modified) -> When a user starts creating a text, a websocket connection with the database is opened, storing image ids, text, last modified information, total memory used so far and privacy settings.

show_advanced_options(session_id, text_id) -> will show advanced options based on the session_id (privacy settings, edit options, password etc)

set_privacy_level(session_id, text_id) -> set privacy on a certain text

set_document_password(session_id, text_id, password) -> password protect a document

delete_text(text_id, session_id) -> Deletes a pasted text by it's id based on the user's session id or based on time since creation

send_to_deletion(text_id, last_modified) -> uses a time series db to store text_id by last_modified date . Once a week has passed, we can delete by the rolling date.

alert_space_capcity(text_id) -> Once a text is over its 1mb space capacity, send an alert telling the user this information and prevent any further additions to the text

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

ToDelete{

last_modified: string,

text_id: string

}

will store each text id by last_modified, once user re accesses this text_id before expiration policy, we can search the db by its previous last_modified date to find this text id to remove it from that date group and give it a new data group

Text{

text_id: string

user_id: string

image_ids: -> string[]

total_memory: float

privacy_status: string

text_link: string

text_password: string

tags: Tag[]

last_modified: datetime

text_history: Text[]

}

Tag{

tag_id: string

tag_name: string

}

User {

user_id: string

text_ids: string[]

}

Search {

text_id: string

text_header: string

}

Image{

image_id: string

}

High-level design

User visits the website and logs into their account
Users can save their text via a button on the screen
users can delete text via a button
users can open up to 10 new "tabs" for 10 different texts
users can set privacy settings and create a link that only allows some users to view
users can use a search bar to find texts related to the input they provide
If a user wants to access a text, we will first check the cache if the text information is already stored in there before checking the database
users text will delete after 7 days

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Text deletion: We can handle text deletion by using a time series database, if a user passes 7 days without changing the text content, all text_ids stored under a date 7 or more days ago will have their corresponding text deleted. If a user updates their text within this 7 day period, we will search our time series database for their previous last_modified date and find the corresponding text id under that date and remove it and add it back to the appropriate new last_modified date

Image upload: We will use an image storage option like amazon S3 and assign every uploaded image a unique id. We will limit how many images can be posted per session id and text to limit storage issues and keep costs down. Image deletion policy will follow a similar policy as text deletion where if a text is deleted then its associated images are also deleted

Trade offs/Tech choices

Image storage: Use a dedicated image storage like amazon S3 or blob storage to handle image upload and then reference the image back with a unique identifier. This will help reduce the strain image uploads will have on storage since we can anticipate many images being implemented within the texts

Time series database: I chose to include this database since I foresee lots of old text being scheduled for deletion daily and maybe even by the hour. We need a database that can store simple information like last_modified date and text_id to tell the server which texts need to be deleted every hour

Cache: we can use a tool like redis to cache text information on frequently accessed texts

I chose to use a relational database to hold text data since dynamic database structure isn't needed here yet I need a highly consistent database that needs near live updates on texts. Furthermore, I believe something highly available is important here which sql does well.

Failure scenarios/bottlenecks

Amazon s3 goes down which prevents image data from being accessed
Time series db goes down preventing texts from being deleted, increasing load on database
Relational db goes down

Future improvements

Have a hot standby of the relational db which holds text data and is constantly replicating data from the master database
Implement a report feature that can flag inappropriate texts for review by a computer for harmful text