My Solution for Design Pastebin with Score: 8/10
by john_chen
System requirements
Functional Requirements:
- Users should be able to upload or paste their data and receive a unique URL for access.
- Only text data can be uploaded.
- Data and links will automatically expire after a specified timespan, with an option for users to set a custom expiration time.
- Users should have the option to choose a custom alias for their paste.
Non-Functional Requirements:
- The system must be highly reliable to ensure no data is lost.
- The system must be highly available, ensuring continuous access to pastes even if the service is down.
- Users should be able to access their pastes in real-time with minimal latency.
- Paste links should be unpredictable to prevent guessing.
Capacity estimation
Our services will be read-heavy, with significantly more read requests compared to new paste creations, following a 5:1 read-to-write ratio.
Traffic Estimates:
- Expected new pastes: 1 million per day.
- Expected reads: 5 million per day.
New Pastes per Second:
- Calculated as:
1M / (24 hours * 3600 seconds) ≈ 12 pastes/sec
Paste Reads per Second:
- Calculated as:
5M / (24 hours * 3600 seconds) ≈ 58 reads/sec
Storage Estimates:
- Maximum upload size: 10MB.
- Average paste size: 10KB (assuming typical use for sharing source code, configs, or logs).
- Daily storage requirement:
1M * 10KB = 10GB/day
Total storage for 10 years:
10GB/day * 365 days/year * 10 years = 36TB
Total number of pastes in 10 years: 3.6 billion.
Key storage requirement using base64 encoding (64^6 ≈ 68.7 billion unique strings):
3.6B keys * 6 bytes/key = 22GB
- Considering a 70% capacity model, the total storage needs increase to 51.4TB.
Bandwidth Estimates:
- Write requests:
12 pastes/sec * 10KB = 120KB/sec
- Read requests:
58 reads/sec * 10KB = 0.6MB/sec
Memory Estimates:
- Cache frequently accessed (hot) pastes following the 80-20 rule (20% of pastes generate 80% of traffic).
- Daily read requests to cache:
0.2 * 5M * 10KB ≈ 10GB
API design
We can use SOAP or REST APIs to provide the functionality of our service. Here are the definitions for the APIs to create, retrieve, and delete pastes:
API: addPaste
Parameters:
- api_dev_key (string): The API developer key of a registered account, used for throttling users based on their allocated quota.
- paste_data (string): The textual data of the paste.
- custom_url (string, optional): A custom URL for the paste.
- user_name (string, optional): The user name to be used to generate the URL.
- paste_name (string, optional): The name of the paste.
- expire_date (string, optional): The expiration date for the paste.
Returns: (string)
- On success, returns the URL through which the paste can be accessed.
- On failure, returns an error code.
API: getPaste
Parameters:
- api_dev_key (string): The API developer key of a registered account.
- api_paste_key (string): The key representing the paste to be retrieved.
Returns:
- The textual data of the paste.
API: deletePaste
Parameters:
- api_dev_key (string): The API developer key of a registered account.
- api_paste_key (string): The key representing the paste to be deleted.
Returns:
- On success, returns 'true'.
- On failure, returns 'false'.
Database design
Observations about the Data:
- We need to store billions of records.
- Each metadata object we store is small (less than 1KB).
- Each paste object can be of medium size (up to a few MB).
- There are no relationships between records, except for storing which user created each paste.
- Our service is read-heavy.
Database Schema:
We need two tables: one for storing information about the pastes and another for storing user data.
In this schema:
URLHash
is the unique identifier for the shortened URL.ContentKey
is a reference to an external object storing the paste contents. We'll discuss the external storage of paste contents later in the chapter.
High-level design
At a high level, our system needs an application layer to handle all read and write requests. This application layer will interact with a storage layer to store and retrieve data. We can segregate our storage layer into two parts: one for storing metadata related to each paste and user information, and another for storing the actual paste contents in object storage (such as Amazon S3). This separation allows us to scale these components individually.
Detailed component design
Application and Data Store Layer Design
a. Application Layer:
Our application layer will handle all incoming and outgoing requests. The application servers will communicate with the backend data store components to serve these requests.
Handling a Write Request:
- Upon receiving a write request, the application server generates a six-letter random string as the paste key (unless the user provides a custom key).
- The application server stores the paste content and the generated key in the database.
- If the insertion is successful, the server returns the key to the user.
- If a duplicate key is generated, the server should regenerate a new key and retry until successful. If the user's custom key is already in use, an error is returned.
Key Generation Service (KGS):
- An alternative is to use a standalone KGS that generates random six-letter strings in advance and stores them in a key-DB.
- The application server retrieves a pre-generated key from the key-DB, ensuring unique keys without worrying about duplicates.
- KGS maintains two tables: one for unused keys and one for used keys. Keys are moved to the used table once assigned.
- KGS keeps some keys in memory for quick access. If KGS fails, unused keys in memory are lost but this is acceptable given the large number of available keys.
Single Point of Failure:
- To avoid KGS being a single point of failure, we can use a standby replica. If the primary server fails, the standby server takes over.
Key Caching:
- Application servers can cache some keys from key-DB to speed up processing. If an application server dies, any unused keys are lost, which is acceptable given the abundance of unique keys.
Handling a Read Request:
- Upon receiving a read request, the application server queries the datastore. If the key is found, the paste contents are returned; otherwise, an error code is returned.
b. Data Store Layer:
The datastore layer is divided into two components:
Metadata Database:
- We can use a relational database like MySQL or a distributed key-value store like DynamoDB or Cassandra.
Object Storage:
- Paste contents are stored in an object storage solution like Amazon S3. This allows easy scalability by adding more servers as needed.