System requirements


Functional:

  1. User Registration and Authentication:
    • Users should be able to create accounts, log in, and log out securely.
    • Option for email verification and password recovery.
  2. Content Submission:
    • Users can submit different types of content, including text posts, links, images, and videos.
    • Each submission should have a title and description.
  3. Voting Mechanism:
    • Users can upvote or downvote submissions to influence their visibility.
    • Display the score of a submission (net votes).
  4. Commenting and Discussions:
    • Users can comment on posts, and comments should support nesting for threaded discussions.
    • Users should be able to reply to other comments.
  5. Subreddits (Communities):
    • Users can create and join subreddits based on interests or topics.
    • Each subreddit should have its own rules, moderators, and content restrictions.
  6. Moderation Tools:
    • Users (especially moderators) can remove or report harmful content.
    • Implement mechanisms for user flags and reports.
  7. User Profiles:
    • User profiles should display submitted content, upvotes, comments, and awards.
    • Support for user customization (profile picture, bio).
  8. Content Feed and Discovery:
    • A home feed that shows trending posts based on user subscriptions and global trends.
    • Search functionality for finding content and subreddits.
  9. Notification System:
    • Notify users about replies to their comments, mentions, upvotes, and messages.
  10. Analytics and Insights:
    • Provide users and moderators with insights on post performance, engagement metrics, and community growth.


Non-Functional:

1. Performance:

  • Latency: The comment submission and retrieval should have a response time of less than 200 ms for users to have a seamless experience.
  • Throughput: The system should handle at least 5,000 concurrent users and support multiple submissions and votes per second.

2. Scalability:

  • The system must be capable of scaling to accommodate growing user bases and increased content submissions, ideally scaling horizontally (adding more servers) rather than vertically.

3. Availability:

  • The system should maintain 99.9% uptime to ensure users can interact with it without interruptions.
  • Implement redundancy to ensure that data is preserved even during component failures.

4. Data Consistency:

  • For comments and votes, ensure eventual consistency to manage discrepancies due to distributed systems.
  • Users should receive immediate feedback for their votes and comments, while background processes manage the overall consistency in the database.

5. Security:

  • All data transmissions should be encrypted (HTTPS) to protect user information and content.
  • Implement proper authentication mechanisms to prevent unauthorized access and ensure user privacy.

6. Usability:

  • The commenting system, feeds, and content submissions should have an intuitive user interface that minimizes the learning curve for new users.
  • Provide clear feedback on actions (e.g., successful submission, error messages).

7. Maintainability:

  • The system should be designed with maintainability in mind, using modular components that facilitate easier updates and debugging.

8. Storage Limits:

  • Set limits on the length of comments and the size of uploaded images to avoid excessive database growth and ensure efficient storage solutions.

9. Interoperability:

  • The system should support integration with third-party services (e.g., image hosting, analytics) and APIs to extend functionality as needed.

10. Accessibility:

  • Ensure the platform adheres to accessibility standards (like WCAG) so that users with disabilities can effectively engage with content.





Capacity estimation

1,000 to 2,000 characters for comments = 2000 bytes per comment

max image size = 5MB

max post size = 20000 chars = 20000 bytes per post

DAU = 1000000 users

3 posts per day per user

10 comments per day per user

2 images per day per user

total amount of data per user per day = 3 * 20000 / 1024 / 1024 + 10 * 2000 / 1024 / 1024 + 2 * 5 = 10 MB

total amount of data per day = total amount of data per user per day * DAU = 10 MB * 1000000 / 1024 = 9.53 TB per day

Server can hold 1 TB of data thus we need 10 servers just to hold daily data usage

for a year we need 3650 servers to hold data


API design

POST /user/create

DELETE /user?id

PATCH /user?id


POST /subreddit

PATCH /subreddit/

PATH /subreddit/


POST /subreddit//comment

DELETE /subreddit//


POST /subreddit//upvote

POST /subreddit//downvote

POST /subreddit//report

POST /subreddit//flag


POST /subreddit///downvote

POST /subreddit///downvote


POST /subreddit/create?name


GET /feed/top

GET /feed/latest


POST /notifications/subscribe?id

POST /notifications/unsubscribe?id


GET /analytics?subreddit



Database design

classDiagram


  User: +int UserID PK

  User: +String Name INDEX

  User: +String Password

  User: +String Email INDEX

  User: + DateTime createdAt

  User: + Role Role


  Subreddit: +String Name INDEX

  Subreddit: +int SubredditID PK

  Subreddit: +DateTime createdAt

  Subreddit: +String Description

  Subreddit: +Int TOtalUsersCount


Post <|-- Subreddit

Post <|-- User

  Post: +int UserID FK

  Post: +int PostID PK

  Post: +int SubredditID FK

  Post: +DateTime createdAt

  Post: +String Body

  Post: +String Title


Comment <|-- Post

Comment <|-- User

Comment <|-- Comment

  Comment: +int CommentID PK

  Comment: +int UserID FK

  Comment: +int PostID FK

  Comment: +int ParentCOmmentId FK

  Comment: +DateTime createdAt

  Comment: +String Body


Image <|-- ImageMeta

Image <|-- Post

  Image: +int ImageID PK

  Image: +int ImageMetaID FK

  Image: +int PostID FK

  Image: +String S3 Link


ImageMeta <|-- User

  ImageMeta: +int UserID FK

ImageMeta: + int ImageMetaID

  ImageMeta: +String Title

  ImageMeta: +DateTime createdAt


PostVotes <|-- Post

PostVotes int PostId

PostVotes int upvotes

PostVotes int downvotes

CommentVotes <|-- Comment

CommentVotes int COmmentId

CommentVotes int upvotes

CommentVotes int downvotes

 


High-level design







Request flows

When creating a post:

1) User contact an entry point

2) User request gets load balanced into an appropriate instance of a post service

3) user creates a post in the subreddit in question

4) post is persisted in the closest write replica and is replicated to the read replica in async manner

5) write is acknowleged to the user, we can do that because it's fine if some other user will see stale data as having a recent posts is not that important


When getting the feed:

1) After a user creates a post and it is written to disc ( write acknowleged) the feed service is notified with the updated score of relevance of the subreddits

2) Feed service updates it's in memory heap data structure to accomodate with the newest hottest feed (if any)

3) top (number chosen by user) feeds are shown back to the user





Detailed component design

Feed ad post db layers consists of a setup of multi read and write replicas. Write replicas are for increased durability and read replicase are for increased read throughput. The focuse is on read replicase as most users don't create posts they read them. Replication is done in async manner as it's not that important for a user to show latest posts or ocmments information. The durability for every write replica node is done using write ahead log and replication crash recovery and replication mechanism is failated by transaction log. COnflict resolution among the write replicas cluster is done through version clocks for every record. If the version clock grows to large (number adjustable) we remove older entries


Top posts consists of an aggregator service that contains in memory data strcuture (heap) which is a dynamically balanced tree that contains most popular posts acros the whole system. When new posts are written to the write replica the write data is also propagated to the top feed service and if the the post is the most hottest it gets inserted into the tree. To deduplicate the posts from potentially multiple replicas we can use a set data structure with a logic of first entry wins.





Trade offs/Tech choices

For Posts and subreddits we can use an SQL storage because the data is structured

For Image we use S3 as a standart for blob storage

We use read replicas for Posts, Comments, and top feeds to increase read trhoughput as read dominate writes in our system

For cache we use redis as an industry standart

For top feeds we use an in memory heap data structure with is limited by the global number of top feeds that we show (adjustable) this is done to make sure that our top feeds data strucutre fits in 1 machine. The data on this machine is replcated to other machines for duraiblility and increased throughput to read replicas.

To limit the amount of data sent over the network when we upload a post or a comment we use stored procedures in SQL db.

To failate secure connection between client and our system we use HTTPS protocol




Failure scenarios/bottlenecks

In case of a failed read replica for any of our services we can restore the data back when it gets back up by sending an updated transaction log back. The timestamp in the replica will be earlier than the new timestapm and new data (delta) will be restored from the log.

In case of top feed data structure is down (including all replicas) we can rebuild it from latest data using a map-reduce framework where we devide the amount of data we want to process on different machines and every machines produces top elements (a small subset of data) which all will take a long time, but will be done only once.





Future improvements

1) Add AI analytics module

2) In case of increased write traffic create more write replicas

3) Add suggestion module for a subreddits that will be relevant to the user based on AI models

4)