System requirements
Functional:
- Store user text (limit 50 pieces)
- Text must be under 10k characters
- User settable expiration for text (default 30 days; max 365 days)
- Paste access is settable (public,shared,private)
Non-Functional:
- Low latency
- Eventual consistency
Capacity estimation
- 10k DAU
- 2 uploads / user / day
- 20k uploads / day
- Assume 1k characters avg per paste
- 20k uploads * 1k characters * 1B / char = 200kB upload / day
- 20k / 100k = 0.2 write QPS
- Read:write ratio = 10
- 2 read QPS
- 200kB * 365 ~= 80MB per year
API design
Upload a paste to pastebin
POST /v1/pastes/
params: {
username
text
access
expiration (in days)
}
returns: {
url
}
Get a paste
Just follow the url
Database design
Pastes:
- Since we only need about 80MB to store all paste data for a year, and the maximum expiration time is one year, we can easily fit all paste data in an in-memory cache like Redis. Redis is also useful here as it has a Time-to-live feature where it will evict data based on the expiration time we get from the user. Pastes will be keyed by their unique URL which is a short base 62 string. A paste object in the cache will store the following metadata:
- User ID of its author
- Access policy
- Created At (timestamp)
- Body (actual paste text)
- expiration policy
Users:
- User data wont be modified or read nearly as frequently as the pastes. A relational DB should be acceptable for Users. A row in the relational DB should at least store the following info for a given user:
- User ID
- Username
High-level design
Functional Blocks:
- Web Server: Handles requests, authentication, authorization, Unique URL generation
- Paste DB: Stores pastes
- User DB: Stores User info
Request flows
Creating a new paste:
- Client uploads a new paste
- Load Balancer routes request to web server
- Web server generates a new unique URL for the paste
- Web server persists the paste to the Paste DB, keyed by it's URL
- The web server returns the new URL to the client
Reading a paste:
- The client follows the paste URL
- The GET request is routed through the load balancer to the Web Server.
- The Web Server looks up the paste in the Paste DB with the URL as the key.
- The Web Server returns the body of the paste metadata to the client.
Detailed component design
- Fault tolerance: This system likely fits easily onto one or two physical machines. However, this constitutes a single point of failure. In order to prevent data loss, this system should be replicated over multiple availability zones.
- DB Design: The entire post DB fits into a single redis cache instance easily. However, redis is not durable. All writes to the redis cache should also be asynchronously forwarded to a persistent DB. A Key-Value store or even a simple log should work well here, since the DB will typically be written to, and only read when reloading the cache after an outage. To facilitate replicating the system into multiple datacenters, the writes to one DB instance should be asynchronously forwarded to its replicas.
- Unique URL generation: since the system is now distributed geographically, we need to ensure that two pastes are not given the same URL. To achieve this, a URL should in some way contain the unique ID of the web server it was generated on to prevent collisions. This could be achieved by simply appending the Unique ID of the web server onto the end of the URL.
Trade offs/Tech choices
- Monolithic architecture: This system has very few components, and the Web Server handles all of the business logic of the system. While this architecture can be considered a Single Point of Failure, it greatly mitigates the complexity that comes with separating functionality into different services.
Failure scenarios/bottlenecks
- Web Server crashes: This could be solved by simply having two or three servers online in each availability zone, or simply routing requests to a different availability zone if this one is unresponsive.
- Redis cache crashes: Paste writes will need to be routed to another datacenter for the time being, but once the Redis cache is back online, it can be reloaded from the local Paste DB. The Paste DB should have eventual consistency with its replicas.
Future improvements
- Integrate a Pay-per-use system where users can pay a certain rate every time they upload a new paste. This could allow us to relax our 50 paste per user limit. If this lead to a significant increase in demand, we would likely need to scale our DB architecture, perhaps by sharding on a hash of the paste URL.