My Solution for Design a Nested Comments System with Score: 9/10

by serenade_xenon816

System requirements


Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

  1. A registered user can create a thread
  2. User can comment on others thread
  3. Any (authenticated/anonymous) user can view threads and comments of the that thread
  4. A user can edit/delete own comment within a time window of 15mins


System Behaviour

  1. Each comment is a parent comment
  2. Reply to a comment forms a nested comment
  3. System organises comments and keeping it in a hierarchical commenting system and displays in reverse chronological order. The comments are well-organised under each comment maintaining a parent-child relationship
  4. A comment can have max 5 nested comment.
  5. Pagination - comments are lazy loaded, stepping through 100 comments each
  6. Users must be authenticated to comment or create a thread


Non-Functional:

List non-functional requirements for the system...

  1. Threads and its comments are rendered within 200ms
  2. System scales to 100 thousands of threads and its comments
  3. Data consistency - the comments should appear in reverse chronological order. the comments might appear with a delay but this eventual consistency is acceptable
  4. Security - only authenticated users can post comments. Comments are moderated to avoid false language, sanitised to prevent XSS vulnerabilities


CAP

System prioritises availability more than consistency. The comments may appear a bit later but system is available for reads and commenting



Capacity estimation

Estimate the scale of the system you are going to design...

  1. Total Users = 100M
  2. Daily Active Users 10% of Total users = 10M
  3. Comments from each user = 15 comments a day = 10M x 15 = 150M comments
  4. Spike on viral topic = 20% of 150M = 180M rounded to 200M
  5. Each comment is of max 250 chars with additional metadata of 100 chars for threadId, authorId, commentId = 200M x 350 = 70,000, 000, 000 = 7B chars = 70GB per day
  6. WPS = 200M/100000 secs per day = 2000 RPS with 700 Bytes payload, with a spike of viral topics 50% = 3000WPS
  7. Storage requirements - write only in database
    1. 70GB per day = 70x30 =2TB per month = 2TB x 12 = 24TB per year rounded to 25TB
    2. In a distributed system writes are replicated to 3 nodes making it 70GB x 3 = 210GB per day
  8. RPS
    1. Read is greater than writes, average 10 users reads the comments = 200M x 10 = 2B reads/day = 20K RPS average with spike of 50% = 30K RPS
    2. . approximately 10x to 100x more than WPS = 30K RPS to 300K RPS




API design

Define what APIs are expected from the system...


Get Threads

GET /1/threads?threadId={id} Response : 200 OK Response Body : [one or more Thread instances]


List All Threads

Get a paginated list of threads with parameters,

  1. cursor: optional, a timestamp, returns threads after this timestamp
  2. limit: optional, a number of threads to return


GET /1/threads?cursor={timestamp}&limit=50 Response : 200 OK Response Body : { threads:[], nextCursor: <timestamp> }


Create a Thread

POST /1/threads Request Header: { Authorization: Bearer <JWT token> } Request body: { A thread instance } Response: 201 CREATED


Post a comment

POST /1/threads?threadId={id}/comments Request Header: { Authorization: Bearer <JWT token> } Request Body: { a Comment instance } Response: 201 Created Response Body: {the created comment instance}


Get comments for a thread

get top-level comments of a thread, if cursor is specified then system returns comments after the specified timestamp and steps through 100 comments at a time. If cursor is not specified then first 100 comments with its sub comments


GET /1/threads/{threadId}/comments/cursor={timestamp}&limit=100 Response : 200 OK Response Body: {one or more Comment instance, each comment instance might contain sub comments in its children attribute}

Get nested comments of a comment

GET /1/comments/{commentId}/replies?cursor={timestamp}&limit=20 Response : 200 OK Response Body: {one or more Comment instance, each comment instance might contain sub comments in its children attribute}



Edit a Comment

PUT /1/comments/{commentid} Request Header: { Authorization: Bearer <JWT token> } Request Body: {content of the comment} Response : 200 OK Response body: {updated Comment instance}


Delete a Comment

PUT /1/comments/{commentid} Request Header: { Authorization: Bearer <JWT token> } Response : 204 NO CONTENT Response body: {updated Comment instance}



The cursor query parameter may be ISO timestamp or UUID of a comment



Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


Relationship Overview

  1. A thread can have many comments
  2. A comment can have many nested comments
  3. A user can create many threads and post many comments


Core Entities

  1. Thread {threadId, title, content, createdOn, updatedOn, authorId, totalComments, isDeleted, lastCommentedOn}
  2. Comment {commentId, threadId, parentCommentId, content, createdOn, updatedOn, isDeleted, depth, lastCommentedOn, upVotes, sortOrder, replyCount, threadIdPath, children}
  3. Author {userId, name, handle, .....}


Fault Tolerance and Redundancy

System deployed multiple database nodes with replication factor = 3, meaning every write is copied to 2 more nodes, making it 3 copies. When a node goes down, the one or the replica can serve the data. Writes are eventual consistent and synchronised in the background. Helps in,

  1. Disaster recovery
  2. redundancy
  3. high availability


Sharding

Threads are sharded by threadId making it distributed to a database node. "Write Node = hash(threadId) % N nodes" . this evenly distributes data to N nodes


Partitioning

In a node, the data is partitioned by threadId to enable parallel read/writes. Threads and related comments resides in same partition optimising access patterns.


Replication

When a thread/comment is written it's replicated to two more database nodes for redundancy, disaster recovery and high availability .


Hot Spot mitigation and caching

The reads are relatively higher than writes, hence the hot threads are cached. When a thread goes viral, the thread and its comments are cached into in-memory cache memcache distributed to multiple nodes. The data is sharded with threadId. The cache uses LRU eviction strategy and cache is updated asynchronously via a message bus.


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...







Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...


Create a thread

User { create thread } > Load Balancer - L7 > API Gateway > ThreadService > Database > CDC: Stream > CDC Processor > Kafka > Flink {verify/filter/Top K adjustments} > Cache update > Notification Service > Notify user


Read top threads

User {Read threads} > Load Balancer - L7 > API Gateway > ListignService > Top 30 threads from cache


Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


  1. ListingService : Reads top threads and top level comments from Memcache, when user clicks on "read more" on nested comments, the service reads from database and populates to in-memory cache. Further search interactions from the user goes through this service to the database.
  2. NoitificationService: A microservice that is invoked by Flink to send the aggregated daily updates and notify via email or SSE to mobile or websocket to browser
  3. Authoriser : Verifies whether write requests are authenticated and reject. if not
  4. Rate Limiter : Graceful degradation on very high load. The requests are denied with retry time period for clients to retry
  5. Content Filter/Sanitizer : A first level defence on content filtering for profanity or harmful, the extensive filtering is done later in the background via stream processing. The stream processing analyse multiple factors including profanity, sentiment analysis, racism, pornography etc and alert admin on those. Such comments are temporarily disabled and wont show up in the UI.
  6. AdminService : A service that lists the alerts produced by Flink, these alerts are for sensitive comments, threads that requires extra attention, the admin can deny/approve the content. if denied its marked as harmful in database and removed from cache if any.


Concurrent Writes

When multiple users posts comment on a thread at the same time, these requests might hit different nodes of ThreadService/CommentService in Kunernetes, each of these contents are updated with an unique identifier (threadId or commentId) which is a twitter-snowflake id, this helps in arranging by time when the system receives the comment/thread. This avoid collisions in database, even if two comments has the same created/updated timestamp set by two different nodes and yet the unique identifier is sortable


Spike and Handling Scale

When a thread goes viral, the reads/writes will spike for the thread this will hit the microservices in the Kubernetes pods, the auto-scaling of Kubernetes is utilised for upscaling the no of nodes and auto-balance the traffic. Each node knows how to generate unique identifiers by itself.


decluttering database and data retention

The retention period for a thread is 90 days, post 90 days the threads and its comments are moved to a hot S3 storage and written in Parquet format. This data can be queried for analytics or by user(search old thread), using Apache Trino + SQL. Further the data in S3 are moved to warm > archive with 1yr retention period





Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


Tech Choices


  1. Database : AWS managed Dynamo DB for fast range queries, sharding and partitioning.
  2. Stream Processing : Flink, Aggregates threads/comments details in 1 min to 5 min window to notify user based on the users topic preferences, also finds the top 30 threads by using Top K popular threads and update cache on each comment update of any of these 30 threads
  3. Broker: Kafka to facilitate changed/added threads/comments and let it flow into flink and updates in-memory cache
  4. In-memory cache: Memcache, distributed caching, caches threads and comments, this is a simple key/value store which we use to populate top threads and its contents "threadId: Json" compact, for each top threadId populate top 10 comments "threadId#<id>#comments: json" which contains for each comment first 5 nested comment.
  5. Microservice : all microservices in Kubernetes stateless mode




Trade Offs

  1. Eventual consistency: The new/updated threads are visible with a bit of delay, however this ensures the data is consistent.
  2. Memcache : this is cache only and wont have writes which are persisted onto database. When the node goes down the cached data are lost, however with clustering and data distribution using consistent hashing. the data is read from another node. Memcache is fast for high read throughput






Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

  1. Hotspots : When a thread goes viral, the partition in which it resides receives a lot of read requests, however caching helps to reduce the load on database servers. Even cache server might experience the load since the sharding is with threadId
  2. Rate Limitter : Some reads might fail or delayed if the system is under heavy load, this is due to the rate limiter, however it helps the system availability and counter spam requests





Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?


  1. Moderate thread and comment contents - let admin hide comments
  2. Sentiment analysis - if the thread or comments are harmful for the public or invoke racism, deny those threads
  3. ML Analytics - understand which threads are likely to be viral and pre-cache those in memory.