System requirements
Functional:
List functional requirements for the system (Ask the chat bot for hints if stuck.)...
- Users can post either binary or string contents to the service and get the unique URL back.
- Users can share the content with others by using the unique URL.
- Users can assign tags and TTL to the content.Other can only read this.
- Users can change the saved content.
Non-Functional:
List non-functional requirements for the system...
- The service should be high-availably and high-reliable.
- The service should be scalable to load and can process high user peaks.
- If users post their content, it may be visible in some 100 milliseconds.
- Only the registered users may post contents, but everyone can read it.
Capacity estimation
Estimate the scale of the system you are going to design...
Let's suppose the users may post 1 Mbytes and DAU is 1 Million users.
The service would be read-heavy, let's count that every user reads the post five times a day and posts new content daily. So it needs to store 730 Terabytes every 1 year ( it counts the data duplication) and 3,6 Petabytes for 5 years.
It needs 5 Terabytes of bandwidth to guarantee the read for 1 Million users.
API design
Define what APIs are expected from the system...
post(apiKey, userId, content,contentLength, tags,TTL) posts the content to the backend and returns the unique key and URL, passing of the content and its length and optional tags and TTL of contents. The use of apiKey guarantees the prevention of abuse of the service.
getPost(apiKey, URL, tags) requests the contact by URL and tags and returns it. The use of apiKey guarantees the prevention of abuse of the service.
deletePost(apiKey,userId, URL) delete the posted content from the service. The use of apiKey guarantees the prevention of abuse of the service.
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
The service would use the eventual consistency to post the content. With such consistency, we may use AWS Dynamo DB.
We would have two tables User, Post, RateLimit
The table User:
id varchar (100 bytes)
email varchar (200 bytes)
createdAt Date (8 bytes)
blocked Boolean (1 byte)
The table Post:
id varchar (100 bytes)
userId varchar (100 bytes)
tags varchar (1000 bytes)
URL varchar (1000 bytes)
content array (1 Mega bytes)
content length 1 integer (8 bytes)
The table RateLimit:
apiKey varchar (1000 bytes)
timestamp date (8 bytes)
The table Post ties with the table User by userId. The table RateLimit contains the requests timestamp to build rate limit functionality.
High-level design
API Gateway provides DDoS protection, TLS termination, and routing requests to the right service nodes.
The main service is the Post service which caters the requests to publish or read posts. New posts are added to Kafka, then the Post service reads and processes it. The most requested posts the Post service stores in the cache. The Post service posts the post to AWS Dynamo DB.
The URL generation service generates the unique URL for new posts and stores them in the cache, and if the Post service needs to get a new unique URL, the URL generation service reads it from the cache.
CDN allows us to locate the posts closer to customers.
Caches may use LRU or LFU eviction policies.
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
The user sends a request to publish a new post and the API Gateway sends it to Kafka. The Post service reads this requests, and ask the URL generation service for a new URL and sends the ready data to AWS Dynamo DB.
When the user wants to read the posts, the API Gateway checks if it's to the closest CDN, if no, it puts the request to Kafka topic. The Post service reads this request and checks if it may be in the cache,if so the post would be returned to the user, otherwise, the Post service finds it in AWS Dynamo DB, stores in the cache, and returns the post to the user.
Detailed component design
Performance and scalability of the Post service is extremely important for this system. As such, we employ two levels of caching. Requests will naturally have locality of access, so caching will be effective. The Post service is stateless and we may use a few instances of this service. Also, we may use the Kubernetes to scale this service efficiently and make it fault-tolerant, because it's a critical part of the system. Kubernetes cares about service availability, and it's health, by ping the configured service endpoint. The Post service uses the cache, we would use Redis to store most requested posts,we are going to use LRU to evict the stale posts. The Post service posts the posts to AWS Dynamo, it's a scalable and fault-tolerant key-value storage managed by AWS. The Post service reads the request from Kafka cluster. We would use Kafak partition replications to prevent losing data.
To search posts by tags faster, we would build an index in AWS Dynamo.
At the closest location from the clients, we will have CDN storing posts s. For example, if a celebrity posts a short URL link in their Social Network post, this mapping should be in CDN. CDN can be hosted at Internet Exchange Points (IXPs), making the response time from client quite short. It has limited storage space, so it should store a small set of the most frequently accessed mappings. High volume of requests are handled by CDN, without even reaching the API Gateway. It is quite beneficial from scalability & fault tolerance perspective.
In the data center, we will employ a caching node, e.g., Redis. As we can install multiple Redis nodes with 100s of GBs of memory, it can store larger set of mappings. It is still faster than accessing the database, so this would provide performance and scalability gain.
Both CDN and Redis Cache can employ Least Recently Used eviction algorithm to ensure currently popular mappings stay in cache.
Partitioning
Database and Cache should be partitioned for improved scalability.
Short URL is a good choice for a partitioning key because:
- The services primarily look up cache & Database by short URL. By having this as the partitioning key, the services can find the right Cache & Database node to access quickly.
- It is randomly generated, so it would be evenly distributed across nodes.
Other partitioning keys (long URL, user ID) would have disadvantages about these points.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?