Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Pastebin

by nectar4678

System requirements

Functional:

Here are some potential functional requirements based on common features of Pastebin-like services:

Create Paste: Users can submit text data which is stored and a unique URL is generated for accessing this paste.
Read Paste: Users can view a paste using the unique URL provided.
Delete Paste: Users can delete their paste before the expiration if they choose (this might require user accounts or a delete token).
Set Expiration: Users can set an expiration time for a paste (e.g., 10 minutes, 1 hour, 1 day, never).
Syntax Highlighting: The service automatically adds syntax highlighting based on the type of code shared.
Access Control: Pastes can be public or private (accessible only through the URL).

Non-Functional:

These requirements focus on the system characteristics needed to support the functional aspects effectively:

Scalability: The system should handle a high number of concurrent users and requests seamlessly.
Performance: Pastes should be quickly accessible after their creation; low latency in retrieval.
Reliability: High availability and fault tolerance to ensure service is up and running at all times.
Security: Secure storage of pastes to prevent unauthorized access, especially for private pastes.
Maintainability: Code and system architecture should be easy to maintain and update.

Capacity estimation

Assumptions

To make informed assumptions, we consider user activity, data sizing, and system interaction patterns:

Daily Active Users (DAU): Let's assume we have 100,000 daily active users.
Paste Creation Rate: Each user creates approximately 5 pastes per day.
Average Paste Size: Each paste is about 10 KB in size.
Read-to-Write Ratio: Each paste is read 10 times more frequently than it is written.

Calculations

Based on the above assumptions, let's calculate the daily and monthly data generation and access patterns:

Daily Paste Creations: 100,000 users x 5 pastes/user = 500,000 pastes per day.
Daily Data Generation: 500,000 pastes x 10 KB = 5,000,000 KB = 5 GB per day.
Monthly Data Generation: 5 GB/day x 30 days = 150 GB/month.
Daily Read Volume: 500,000 pastes x 10 reads = 5,000,000 reads per day.

Storage Requirements

Considering storage requirements, including redundancy for reliability and backups for data recovery:

Immediate Storage Need: Assuming a paste lifespan of 30 days on average, we would need about 150 GB of active storage.
Backup and Redundancy: Adding redundancy (e.g., triple replication), the storage need would be 450 GB.

Bandwidth Estimations

For handling network traffic, especially during peak times:

Peak Load Writes: 500,000 pastes/24 hours ≈ 6 pastes per second.
Peak Load Reads: 5,000,000 reads/24 hours ≈ 58 reads per second.
Data Transfer: At peak, 6 writes/second * 10 KB = 60 KB/s writing, and 58 reads/second * 10 KB = 580 KB/s reading.

API design

We'll outline the primary API endpoints needed to support the basic functionalities of the service: creating, reading, and deleting pastes, as well as setting expiration times.

1. Create Paste

Endpoint: POST /pastes
Description: Allows users to create a new paste.

Request Body:
{
"text": "Hello, world!",
"expiresIn": "1h",
"syntax": "plaintext"
}
Response Body:
{
  "pasteId": "abc123XYZ",
  "url": "http://example.com/pastes/abc123XYZ"
}

2. Read Paste

Endpoint: GET /pastes/{pasteId}
Description: Retrieves the content of a paste by its unique ID.

Response Body:
{
"pasteId": "abc123XYZ",
"text": "Hello, world!",
"createdAt": "2024-05-12T12:00:00Z",
"expiresAt": "2024-05-12T13:00:00Z",
"syntax": "plaintext"
}

3. Delete Paste

Endpoint: DELETE /pastes/{pasteId}
Description: Deletes a paste before its expiration time, if authorized.

Response Body:
{
  "success": true,
  "message": "Paste deleted successfully."
}

4. Update Paste Expiration

Endpoint: PATCH /pastes/{pasteId}
Description: Updates the expiration time of an existing paste.

Request Body:
{
  "expiresIn": "1d"
}
Response Body:
{
  "success": true,
  "message": "Expiration updated successfully."
}

Database design

This schema supports the primary functionalities needed: creating, accessing, and deleting pastes, with efficient lookups by paste_id. The choice of a relational database (like PostgreSQL or MySQL) would be suitable given the structured nature of the data and the requirement for high reliability and consistency.

Considerations

Indexing: Index paste_id for fast retrieval. Consider indexing expires_at for efficient cleanup of expired pastes.
Cleanup Process: Implement a routine to purge expired pastes based on expires_at to free up space and maintain performance.

High-level design

This will outline the main components and their interactions, providing a clear overview of how the system is structured to fulfill the requirements.

Web Server: Handles HTTP requests, serving the static UI and API endpoints.
Application Server: Processes business logic, including paste creation, retrieval, deletion, and expiration management.
Database Server: Stores all data related to pastes and users, including text content, user information, and access logs.
Cache System: Reduces database load by caching frequently accessed pastes.
Load Balancer: Distributes incoming traffic and requests to multiple application servers, improving reliability and performance.
Authentication Service: Manages user authentication and authorizes operations like creating and deleting pastes.
Background Job System: Handles tasks like purging expired pastes and sending notifications if necessary.

Description of Interactions

Client: Users interact with the system via a web browser or application.
Load Balancer: Ensures even distribution of load across servers to maintain performance during peak times.
Web Server: Serves as the entry point for all requests, forwarding to the application server after initial handling.
Application Server: The core logic layer that processes all requests, interacts with the authentication service for security, manages cache for efficiency, and schedules tasks.
Authentication Service: Secures the system by authenticating users and ensuring that requests are authorized.
Cache System: Improves response times and reduces load on the database by caching popular or recently accessed pastes.
Database Server: Persistent storage of all data, including user information and paste content.
Background Job System: Manages time-dependent operations like deleting pastes that have expired.

Request flows

Creating a Paste

This flow describes how a new paste is created in the system

Steps:

The client sends a POST request to create a new paste.
The web server forwards the request to the application server.
The application server checks the cache for a similar existing paste; if missed, it proceeds.
The paste data is stored in the database.
The application server retrieves a confirmation and generates a unique URL.
The web server sends the URL back to the client.

2. Retrieving a Paste

This flow illustrates how a paste is accessed:

Steps:

The client requests a paste via a GET request.
The request is processed through the web server to the application server.
The application server first looks up the cache.
If the paste is in the cache, it is returned immediately.
If not, the database is queried, the result is cached, and then the content is returned to the client.

3. Deleting a Paste

This flow details how a paste is deleted:

Steps:

The client sends a DELETE request to remove a paste.
The web server forwards the request with authentication details to the application server.
The application server confirms ownership and then instructs the database to delete the paste.
Upon successful deletion, a confirmation is sent back to the client.

Detailed component design

1. Application Server

Description

The Application Server is the core logic handler in our architecture, processing all requests related to pastes, such as creating, retrieving, updating, and deleting them. It also handles user authentication and authorization.

Scalability & Performance

Load Balancing: Multiple instances of the application server can be deployed to handle high traffic, with a load balancer efficiently distributing requests based on server load and health.
Statelessness: Ensures that each server can handle any request independently, increasing the reliability and scalability.

Algorithms/Data Structures

Rate Limiting: To prevent abuse, a token bucket algorithm can be used to rate limit API requests from users.
Caching Strategies: LRU (Least Recently Used) or LFU (Least Frequently Used) caching strategies might be implemented in-memory for quick access to popular or recent pastes.

2. Cache System

Description

The Cache System serves as a temporary storage for frequently accessed pastes, reducing the load on the database and speeding up response times for read operations.

Scalability & Performance

Distributed Cache: Using solutions like Redis or Memcached, which support horizontal scaling and can be deployed in a cluster mode to handle larger data volumes and more requests.
Expiration Handling: Automatic expiration of cache entries based on the paste's expiry time or a default TTL (Time To Live) to maintain freshness.

Algorithms/Data Structures

Hash Tables: Key-value storage of pastes with O(1) access time.
Expiration Policies: Implementation of TTL mechanisms to ensure data is not stale and syncs with database states.

3. Database

Description

The Database stores all persistent data including user details, pastes, and access logs. It ensures data integrity and provides a reliable source of truth for the application.

Scalability & Performance

Partitioning/Sharding: Data can be partitioned based on the paste_id or user_id to distribute the load across multiple database servers.
Replication: Read replicas can be used to distribute read load, especially for high read scenarios.

Algorithms/Data Structures

B+ Trees: Used in indexing to allow quick search, insertion, and deletion of records.
ACID Transactions: Ensures that all database transactions are processed reliably, especially important for operations like creating users and deleting pastes.

Trade offs/Tech choices

1. Stateless Application Servers

Choice: Implementing stateless application servers.
Trade-off: While this increases scalability by allowing requests to be evenly distributed across multiple servers without session continuity concerns, it can complicate scenarios where state persistence is beneficial (e.g., sessions, multi-step transactions).
Justification: Facilitates easy scaling and ensures that the failure of one server doesn't disrupt the overall service.

2. Use of a Distributed Cache System

Choice: Employing a distributed cache system like Redis.
Trade-off: Adds complexity and overhead in managing another layer in the stack but significantly enhances read performance.
Justification: Reduces database load and improves response times for frequently accessed data, crucial for a read-heavy environment.

3. Database Replication

Choice: Using a master-slave replication model for the database.
Trade-off: Increases data redundancy and resource usage but may introduce replication lag under heavy load.
Justification: Enhances read capacity and ensures data availability, allowing the system to continue functioning even if the master database fails.

4. Microservices vs. Monolith

Choice: Structuring the back-end as a set of microservices (Authentication, Application Logic, Background Jobs).
Trade-off: While microservices increase the complexity of the system and require more sophisticated orchestration and monitoring, they allow for more focused scaling and independent development and deployment.
Justification: Provides flexibility in managing different aspects of the system and can evolve independently as needs change.

5. Load Balancing Strategy

Choice: Using a load balancer to distribute incoming traffic.
Trade-off: Introduces a potential single point of failure and additional network latency.
Justification: Ensures even distribution of traffic and improves fault tolerance by rerouting traffic if a server fails.

Failure scenarios/bottlenecks

Cache Overload:

Scenario: If the cache layer is overwhelmed by too many read requests, it might slow down, causing increased latency or even service outages.
Bottleneck: Limited cache capacity and throughput, especially under spike traffic conditions.

Database Write Throughput:

Scenario: During peak loads, the master database might struggle to handle a high volume of write operations, affecting performance and potentially causing data write delays.
Bottleneck: Single master in the database setup can become a write bottleneck.

Single Point of Failure in Load Balancer:

Scenario: If the load balancer fails, the entire service can become unreachable.
Bottleneck: Dependency on a central component for traffic distribution.

Future improvements

Enhancing Cache Resilience:

Improvement: Implement an elastic cache with auto-scaling capabilities to handle variable loads seamlessly.
Benefit: Maintains performance under spike loads and reduces the risk of cache-related failures.

Database Scalability:

Improvement: Introduce more sophisticated database sharding and potentially move to a multi-master setup if write volume becomes a limitation.
Benefit: Increases write throughput and resilience, distributing the load more evenly across the system.