System requirements
Functional:
- Store a max of 1mb worth of data per user
- Store text for a max week per session id
- Options to set paste expiration times
- Privacy settings to make their pastes public (accessible via link) or private (only accessible to user) and ability to password protect a paste
- Support for basic formatting (bold, italics, list etc) or markdown for enhanced presentation
- Support for images/files
- Search functionality for users to find their own pastes quickly with a tagging system to categorize pastes
- Ability to view or revert to previous versions of a paste
- Timestamps for when paste was created or last modified
- Analytics to view statistics such as view counts on a paste or shares of a paste
- Make an api for developers who want to create, retrieve or manage pastes
Non-Functional:
- Create a user session using their local storage
- Must be highly scalable
- Must be highly consistent
Capacity estimation
Lets assume each user can have a maximum of 5 different texts and each being capped to 1MB worth of data. Each text can have a max of 5 tags. Let us also assume this this application is available in 10 countries and has a lifespan of 10 years with an average of 1000 users per country. Lets also assume uploaded images are compressed to a 100kb. Let us also assume we will save up to 10 previous versions of each text with the oldest being deleted.
User {
user_id: 10 bytes
username: 10 bytes
password: 10 bytes
text_ids: 10bytes * 5
}
10 + 50 bytes = 60 bytes
Text{
text_id: 10 bytes
user_id: 10 bytes
(image_ids: any text:any) -> max between these two fields is 1mb
total_memory: 5 bytes
privacy_status: (public,private) 10 bytes
text_link: 100 bytes
text_password: 50 bytes
tags: 50 bytes * 5
last_modified: 8 bytes
text_history: 10mb
}
10b + 10b + 1mb + 5 + 10 + 50 + 250bytes + 8 + 10mb =
11.000343 mb
To_Delete{
text_id: 10 bytes
last_modified: 8 bytes
}
10 + 8 = 18b
Tag{
tag_id: 10 bytes
tag_name: 50 bytes
}
10 + 50 = 60b
Image{
image_id: 10 bytes
}
10 + 100000 = 100010 bytes
Search {
text_id: 10 bytes
text_header: 20 bytes
}
10 + 20 = 30 bytes
60 + 30 + 100010 + 60 + 18 + 11.000343mb = 11.100521mb
11 mb
1000 users per country * 10 countries * 11mb * 10 years = 1.1 Terabyte
API design
input_search_header(text_id, text_header) -> This function will be called when a user creates some text. It will store up to 20 bytes of data starting at the beginning of the text in a key value store in order to be used by a search function
login(username,password) -> allows a user to login to their account
search(text) -> This function will utilize a websocket to take user input data and search for headers that correspond to the user input text
create_text(session_id, text, image_ids, privacy_settings, text_link,text_password,tags,text_history,total_memory,last_modified) -> When a user starts creating a text, a websocket connection with the database is opened, storing image ids, text, last modified information, total memory used so far and privacy settings.
show_advanced_options(session_id, text_id) -> will show advanced options based on the session_id (privacy settings, edit options, password etc)
set_privacy_level(session_id, text_id) -> set privacy on a certain text
set_document_password(session_id, text_id, password) -> password protect a document
delete_text(text_id, session_id) -> Deletes a pasted text by it's id based on the user's session id or based on time since creation
send_to_deletion(text_id, last_modified) -> uses a time series db to store text_id by last_modified date . Once a week has passed, we can delete by the rolling date.
alert_space_capcity(text_id) -> Once a text is over its 1mb space capacity, send an alert telling the user this information and prevent any further additions to the text
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
ToDelete{
last_modified: string,
text_id: string
}
will store each text id by last_modified, once user re accesses this text_id before expiration policy, we can search the db by its previous last_modified date to find this text id to remove it from that date group and give it a new data group
Text{
text_id: string
user_id: string
image_ids: -> string[]
total_memory: float
privacy_status: string
text_link: string
text_password: string
tags: Tag[]
last_modified: datetime
text_history: Text[]
}
Tag{
tag_id: string
tag_name: string
}
User {
user_id: string
text_ids: string[]
}
Search {
text_id: string
text_header: string
}
Image{
image_id: string
}
High-level design
- User visits the website and logs into their account
- Users can save their text via a button on the screen
- users can delete text via a button
- users can open up to 10 new "tabs" for 10 different texts
- users can set privacy settings and create a link that only allows some users to view
- users can use a search bar to find texts related to the input they provide
- If a user wants to access a text, we will first check the cache if the text information is already stored in there before checking the database
- users text will delete after 7 days
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Text deletion: We can handle text deletion by using a time series database, if a user passes 7 days without changing the text content, all text_ids stored under a date 7 or more days ago will have their corresponding text deleted. If a user updates their text within this 7 day period, we will search our time series database for their previous last_modified date and find the corresponding text id under that date and remove it and add it back to the appropriate new last_modified date
Image upload: We will use an image storage option like amazon S3 and assign every uploaded image a unique id. We will limit how many images can be posted per session id and text to limit storage issues and keep costs down. Image deletion policy will follow a similar policy as text deletion where if a text is deleted then its associated images are also deleted
Trade offs/Tech choices
Image storage: Use a dedicated image storage like amazon S3 or blob storage to handle image upload and then reference the image back with a unique identifier. This will help reduce the strain image uploads will have on storage since we can anticipate many images being implemented within the texts
Time series database: I chose to include this database since I foresee lots of old text being scheduled for deletion daily and maybe even by the hour. We need a database that can store simple information like last_modified date and text_id to tell the server which texts need to be deleted every hour
Cache: we can use a tool like redis to cache text information on frequently accessed texts
I chose to use a relational database to hold text data since dynamic database structure isn't needed here yet I need a highly consistent database that needs near live updates on texts. Furthermore, I believe something highly available is important here which sql does well.
Failure scenarios/bottlenecks
- Amazon s3 goes down which prevents image data from being accessed
- Time series db goes down preventing texts from being deleted, increasing load on database
- Relational db goes down
Future improvements
- Have a hot standby of the relational db which holds text data and is constantly replicating data from the master database
- Implement a report feature that can flag inappropriate texts for review by a computer for harmful text