System Requirements
Functional
- User Registration and Authentication
- Users can create accounts and delete accounts
- Users can log in and log out
- Password can be reset and account recovery mechanisms are supported
- Posting Content
- Users can submit text posts to links to various topics, called subreddits
- Support for formatting options like Markdown or Rich Text
- Users can also posts images, videos and GIFs
- Voting System
- Users can upvote or down-vote posts or comments
- Vote fuzzing is implemented to combat maniplulation
- Commenting and Discussion
- Users are allowed to make comments on posts and they can also reply to other comments
- Users can tag users, the tagged user will get a notification that they were mentioned
- Discussions are threaded to combat manipulation
- Subreddit Management
- Users can create new subreddits based on different topics
- Users can subscribe to a subreddit to get posts from that subreddit in their feed
- Subreddits can appoint moderators that can moderate subreddits, allowing the moderator to delete comments, posts etc.
- User Profile
- Users can set profile pictures
- Users can follow other users
- Posts, comments, replies made by a user are displayed in their post history on their profile
- Sorting is allowed based on the most number of likes by time period
- Users can follow other users
- Users can send each other private messages
- Content Moderation
- A system is set in place to detect and remove spam, offensive content and ensure content quality
- Recommendation Engine
- An engine handles content recommendation based on user activity and likes
- Users can report inappropriate content
- Notification
- Users receive notifications for new comments, likes and other relevant activities
- Real time notifications are added
- Home Feed
- A user's home feed is filled with content based on the recommendation engine and content from subreddits and users they follow
- The home feed can be customized to show popular posts made in the last hour, day, week, month, year etc.
Non-Functional
- Scalability
- The system should be able to handle a large number of posts, users effectively
- Horizontal scaling should be supported to accommodate a growing service
- Performance
- Response times for users for posting content, voting, and commenting should be minimal
- The platform should be responsive even during peak hours
- Reliability
- The system should be resilient to network failures
- Any content posted by the user cannot be lost
- Data should be replicated multiple times over to protect against failures
- Security
- User data should be encrypted and protected against unauthorized access
- Only authorized users should be able to post content
- Consistency
- We aim for eventual consistency here, we are okay with users seeing posts from a user they follow after a while
- Usability
- User experience should be intuitive and smooth
- Easy to navigate
- Compliance
- We need to comply with privacy laws based on regions, such as Europe's GDPR
- Content moderation policies should adhere to community guidelines and legal requirements
Capacity Estimation
- We assume the site has a total of 500 million users, of which 100 million are active daily
- We assume a total of 2 million subreddits
API Design
- User Management APIs
- /register Creates a new account
- /login Login account
- /logout Logout of account
- /user/profile Fetch user profile
- /user/update Update user profile
- Post Management APIs
- /post/create Create post
- /post/:postID Get a specific post
- /post/:postID/update Update a post
- /post/:postID/delete Delete a post
- /post/:postID/upvote Upvote a post
- /post/:postID/downvote Down-vote a post
- /post/:postID/comments Fetch comments for a post
- Sub-reddit Management APIs
- /subreddit/create Create a new subreddit
- /subreddit/:subredditID Fetch details of a subreddit
- /subreddit/:subredditID/delete Delete a subreddit
- /subreddit/:subredditID/update Update a subreddit
- /subreddit/:subredditID/posts Fetch posts from a subreddit
- Comment Management APIs
- /comments/:commentID Retrieve details of a specific comment
- /comments/:commentID/update Update a specific comment
- /comments/:commentID/delete Delete a specific comment
- /comments/:commentID/upvote Upvote a comment
- /comments/:commentID/downvote Downvote a comment
- /comments/create Write a new comment
- Moderation and Reporting APIs
- /moderation/posts Retrieve posts reported for moderation
- /moderation/comments Retrieve comments reported for moderation
- /moderation/action Take action on a post
- Search & Discovery APIs
- /search/posts Search for posts with a keyword
- /search/users Search for a specific user
- /trending/posts Search for trending posts
- /recommend/posts Get recommended posts
- Notification APIs
- /notifications Retrieve notifications for the user
- /notification/markread Mark notifications as read
- Misc APIs
- /health Check the health status of the system
- /metrics Retrieve system metrics for monitoring and analytics
Database Design
Below is the diagram that describes the tables.
classDiagram class User { + userId: string + username: string + email: string + passwordHash: string } class Post { + postId: string + userId: string + subredditId: string + title: string + content: string + createdAt: date + updatedAt: date } class Comment { + commentId: string + postId: string + userId: string + content: string + createdAt: date + updatedAt: date } class Subreddit { + subredditId: string + name: string + description: string + createdAt: date + updatedAt: date } User "1"--o "*" Post : creates User "1"--o "*" Comment : writes User "1"--o "*" Subreddit : moderates Post "1"--o "*" Comment : has Subreddit "1"--o "*" Post : contains Comment "1"--o "1" Post : belongs to
- User Data
- Stores user profile data
- Relational Database
- User data is well structures and generally requires ACID transactions. Relational databases provide strong consistency and transactional support making them suitable for user management, authentication and activity logs.
- We prioritize consistency and partition tolerance
- Posts and Comments Data
- Stores posts and comments data
- Non-relational database
- Post and comments data is semi-structures and can benefit greatly from sequential reads and writes, thus, we choose a non-relational database here. They are suitable for read and write heavy workloads with loose consistency requirements. Here, since we aim to store comments and post data, we can use a document style store as that suits that kind of data well
- We prioritize availability and partition tolerance here while aiming for eventual consistency
- Subreddit Data
- Stores subreddit related data
- Relational Database
- Data is structures and should thus be stored in a relational database to provide relational modeling capabilities
- We prioritize consistency and partition tolerance here
- For partitioning our relational databases, we can use user IDs to partition data
- For partitioning non-relational databases, we can use subreddit ID or post IDs. This ensures faster reads and writes
- Geographical partitioning is not necessary for a service like reddit since it is able to serve all users globally without any geographical constraints
- If there are regulatory or compliance guidelines based on region, then we need to partition data based on geographical locations using geographic data centers
- We scale all our data stores horizontally to ensure performance is not compromised. Also, we replicate data to ensure data reliability
- For storing media files like images, videos etc. we can use an object store
- Content Delivery Networks are utilized to cache static content like thumbnails, HTML, CSS etc. to users globally, reducing latency and bandwidth usage
High-Level Design
graph TD; subgraph Client_Applications WebApp(Web Application) MobileApp(Mobile Application) end subgraph Frontend_Services LB1(Load Balancer) Auth(Authentication Service) Notif(Notification Service) LB1 --> Auth LB1 --> Notif end subgraph Backend_Services API(API Gateway) User(User Service) Post(Post Service) Comment(Comment Service) Subreddit(Subreddit Service) Search(Search Service) Recommend(Recommendation Service) Moderate(Moderation Service) Notify(Notification Service) Analytics(Analytics Service) API --> User API --> Post API --> Comment API --> Subreddit API --> Search API --> Recommend API --> Recommendation end subgraph Infrastructure_Components LB(Load Balancer) Cache(Caching Layer) CDN(Content Delivery Network) DB(Database Clusters) ObjStore(Object Storage) end Client_Applications --> |Routes Requests| API API --> Frontend_Services Backend_Services --> LB Client_Applications --> CDN DB --> ObjStore Monitor --> Backend_Services Security --> Backend_Services
- Client Applications (Web & Mobile)
- Responsible for providing user interfaces for interacting with the platform, allowing users to browse content, submit posts, comment and interact with other users
- Frontend Service
- Authetication service handles user authentication and authorization, ensuring secure access to the platform
- Notification Service sends real time notifications to users for activities such as new comments on their posts or replies to their comments
- Backend Service
- API Gateway acts as a single entry point for client applications to access backend services, managing requests and routing them to the appropriate service
- User Service handles making changes to user accounts
- Post Service facilitates in the creation of new posts, updating posts, etc.
- Comment Service is responsible for everything related to comments
- Subreddit Service manages everything related to subreddits
- Search Service helps users search for posts based on keywords
- Recommendation Service recommends posts to a user based on their interests, followed subreddits etc.
- Moderation Service handles content moderation, including detecting and removing spam, offensive content and enforcing community guidelines
- Notification Service sends notifications to a user when someone comments on their comment, replies to them etc.
- Analytics Service collects and analyzes data on user interactions, content popularity and platform usage to provide insights for optimization and decision making
- Infrastructure Components
- Load Balancer handles server related load and routes traffic appropriately
- Caching Layer stores frequently accessed data to reduce database load and improve response times
- Content Delivery Network delivers static content to users
- Database Clusters store data effectively
- Object Stores store media files
- Messaging Queues support asynchronous communication between components of the system, ensuring reliable message delivery and decoupling services
- Monitoring and Logging Services monitor system health and performance, providing insights for troubleshooting, optimization and compliance
- Security and Firewall Components protect the system from unauthorized access, malicious attacks and data breaches
Request Flows
sequenceDiagram participant ClientUser as Regular User participant ClientMod as Moderator participant API as API Gateway participant Post as Post Service participant Mod as Moderation Service participant DB as Database ClientUser ->> API: Create New Post API ->> Post: Create New Post Request Post ->> DB: Save New Post DB -->> Post: Post Saved Post -->> API: Post Saved Response API -->> ClientUser: Post Saved Response ClientMod ->> ClientMod: Review Pending Posts ClientMod ->> API: Retrieve Pending Posts Request API ->> Mod: Retrieve Pending Posts Request Mod ->> DB: Retrieve Pending Posts DB -->> Mod: Pending Posts Retrieved Mod -->> API: Pending Posts Retrieved API -->> ClientMod: Pending Posts Retrieved ClientMod ->> ClientMod: Review and Approve Post ClientMod ->> API: Approve Post Request API ->> Mod: Approve Post Request Mod ->> DB: Update Post Status to Approved DB -->> Mod: Post Status Updated Mod -->> API: Post Status Updated API -->> ClientMod: Post Status Updated
- User Creates a New Post:
- The client application (web or mobile) sends a request to the API Gateway to create a new post.
- The API Gateway routes the request to the Post Service.
- The Post Service validates the request, creates a new post object, and saves it to the database.
- The Post Service triggers notifications to subscribers of the subreddit where the post was created.
- Moderator Reviews and Approves the Post:
- The moderator accesses the moderation dashboard in the client application.
- The client application sends a request to retrieve pending posts to the API Gateway.
- The API Gateway routes the request to the Moderation Service.
- The Moderation Service retrieves pending posts from the database and presents them to the moderator.
- The moderator reviews the post and decides to approve it.
- The client application sends a request to approve the post to the API Gateway.
- The API Gateway routes the request to the Moderation Service.
- The Moderation Service updates the status of the post in the database to "approved".
Detailed Component Design
- User Management Component
- User Roles & Permissions
- Regular users have basic privileges such as creating posts, commenting and voting
- Moderators have additional rights within specific subreddits where they can moderate posts
- Administrators have full control over the platform, including user management, system configuration and policy enforcement
- User Roles & Permissions
- Threaded discussions are displayed hierarchically, with nested replies indented to visually indicate their relationship to parent comments
- Recommendation engine can recommend content based on what the user follows, we can use machine learning modes to deduce what other content the user might like
- We cache trending posts completely, for other posts we can cache post metadata since a lot of times, users will read the post title and not open the actual post. If the user wants to see the post, we fetch the complete post then
- We can use Least Recently Used (LRU) as our caching strategy as using this strategy, only the popular posts will remain in memory and the lesser popular ones will, over time, be evicted from the cache. We can configure the cache to store N posts, where N is a parameter than is configurable
- We implement automated spam detection algorithms to identify and flag suspicious user activities, such as posting repetitive or low quality content, excess link sharing, such posts can be marked as 'pending moderation'
- We can utilize machine learning and natural language processing (NLP) models to analyze posts for hate speech and harmful content
- We enforce rate limiting and throttling mechanisms to restrict the frequency and volume of user actions, such as posting, commenting etc.
- Set such limits based on average user behavior
- Reporting tools can help users to report content that slips away from our moderation service
Trade Offs/Tech choices
- For our relational database of choice, we choose MySQL
- For our non-relational document store of choice, we choose MongoDB
- For our cache, we use in memory key value store Redis which will provide us with fast reads and writes
- For our object store, we use Amazon S3
- For our CDN, we can use Cloudfare or Akamai
- For our messaging queue, we use Apache Kafka
Failure Scenarios/Bottlenecks
- Hot Spot can happen when a celebrity makes a post and that post has very high traffic. That will concentrate a large number of requests to a very small part of our service
Future Improvements
- Implement distributed counters to count upvotes on posts, comments etc.
- Offline support
- Let users give awards to other users