Design Twitter - System Design

System Requirements

Functional:

Users can send messages (called tweets( for up to 200 characters
Users can follow other users, called friends
Users can like another user's message
A user's home page will show the messages of the users it is following in chronological order, with the newest one appearing first
The user's home page will also show the top K messages from the users it is following. The popularity of the message will be decided based on the number of likes it has accumulated so far
Users can block posts of a certain user from appearing on their home page

Non-Functional:

Scalability - We expect a high number of daily active users, probably 500 million with the number growing with time, so the system must be highly scalable
Availability - The service needs to be highly available
Latency - The service should have a low latency, a user should see their homepage populated in no more than 500ms
Consistency - This can be a bit relaxed as the user does not need to see the posts in his/her homepage the moment their friend makes a post, a slight delay is acceptable. Instead of strong consistency, we aim for eventual consistency
Security - Content moderation and anti-abuse protection can be implemented, although we do not go over that in this design

Capacity Estimation

We assume 500M daily active users and a maximum message size of 200 characters or 200 bytes
We assume that the average user sends 2 messages a day. Total number of messages that we deal with everyday would be 1 billion = 10^9
200 bytes of content data in a message plus other overheads, let us assume a message takes 512 bytes of storage
Total data generated in a day would be 1 TB
In one year we would need 365 TB data
Factoring in replication and other overheads, we can assume we would need 1000 TB data every year or 1 PB
Over 10 years, we would generate 10 PB data

API Design

Send Message
1. POST
2. Takes in the user ID and message content as parameters
3. Allows user having userID to post a message with the given content. The contents of them message would be passed on a JSON, parsed by the server when displaying
4. Every post will have unique ID associated with it
Like Message
1. PUT
2. Takes in userID and postID as the parameter
3. Allows the user to like a particular post. This increases the like counter of the post by 1
Follow
1. POST
2. Takes in two arguments, followerID, followeeID
3. User with user ID followerID starts following the user with the user ID followeeID
Unfollow
1. PUT
2. Takes in two arguments, followerID, followeeID
3. User with user ID followerID starts following the user with the user ID followeeID
Home Feed
1. GET
2. Takes in the argument user's ID and offset, an integer. Offset determines the number of posts that need to be returned by the request
3. Fetches the home feed of the user having ID userID. The feed is fetched in the form of a JSON that is parsed on the client. The post could have links to media that would be fetched from an object store or the CDN
4. When a user opens his home page, the offset will be 0. In case the user refreshes the home page, the offset can be a small number like 5 to reduce server load. Also, the offset can be device specific, something like a mobile device does not need a large offset figure. As the user keeps scrolling lower in their feed, we can keep requesting more posts using this API

Database Design

Based on the capacity estimations done earlier, a relational database is not advised
We do not utilize joins here in any capacity here, another reason to not use a relational database
We opt for a non-relational database that is based on the document style store to effectively store posts made by the user
MongoDB is a good choice here since it is non-relational, a document style store and allows schema flexibility
The data model will be as given below.
1. messages
  1. Stores post related data
  2. messageID Primary Key
  3. userID Index
  4. created_at
  5. links
  6. content
  7. like_count
2. user
  1. Stores user information
  2. userID Primary Key
  3. email
  4. phone
  5. name
  6. date_of_birth
3. follow
  1. Stores user follower relationship. This can be used to quickly find the users that follow a particular user. We can then quickly know that when a user makes a post, all the users that follow it need to see the post
  2. userID Primary Key
  3. followerList List of users that follow this particular user
Instead of storing this in a document style format, we can store this relationship in a graph based non-relational database like neo4j
The like counter would be a massive scalability challenge. If a celebrity person makes a post, it will be liked by a large number of people. This will make the like counter for that particular post a hot spot for change. To protect against this, we would implement a distributed counter service for users with over let's say 10,000 users. This is a number that can be configured later. In this case, shared counters would do the counting using multiple different counters implemented in memory, using an in-memory store like Redis and once in a while, let's say every 5 minutes, we would hit our document store and update the like counter

High-level design

A content delivery network service like Akamai or Cloudfare can be used to serve all the static content
A rate limiter can protect against denial of service attacks, intentional or otherwise
Tweet service's job is to receive requests for new tweets and populate them in the document store. It can queue up messages in a message queue, which would decouple components and the same message can be utilized by analytics or other services as well
Tweet analyzer pulls messages from the message queue and analyzes them for inappropriate content, hashtags etc.
Home feed service handles populating the home feed of a user
Data should be shared by region as users of a certain region would be more interested in posts of that region. Globally trending posts could be cached in the CDN and served to all regions
Data should be replicated among data centers for greater fault tolerance

flowchart TD CL[client] --> CDN[CDN] CDN --> RL[Rate Limiter] RL --> LB[Load Balancer] LB --tweet()--> TW[Tweet Service] TW --Store--> DB[DB: MongoDB] TW --> MQ[Message Queue] TA[Tweet Analyzer] --Pull Message--> MQ TA --Update-->DB CA[Content Analyzer] --Pull Message-->MQ LB --home_feed()--> Home[Home Feed Service] Home --> DB Home --> Cache[Redis Cache] LB --like()-->CT[Sharded Counter] CT-->DB

Request flows

Populate Home Feed
1. Get the user IDs of the users following the person making the post.
2. Populate the post in the cache of all the users.
Get Home Feed
1. Check if posts are present in the post cache, if yes, get them, if not, hit the DB
Follow User
1. Go to the user ID which our current user wants to follow
2. Add ID to the follower list

Detailed component design

Score of Tweet = weight_like * number_of_likes + weight_followers * number of followers on the tweet's author + weight_retweet * number_of_retweets

The top-K posts to be shown on a user's home page can cached for quick retrieval. The top posts can be found by inserting all of the posts in a max heap data structure and then picking the top K posts. The max heapify operation would run based on the tweet score calculated above.

The cached data can be stored in a caching system like Redis, this does not need to be persisted on disk since it is subject to change frequently.

For users that do not log in frequently, getting top K posts can be considered a waste of computing power. We can stop calculating the top K posts for users that have not logged in in the last 7 days, where the number of days is a configurable parameter. For these users, the posts can be populated on login and then these users can be added to the circulation of users for whom posts are cached.

Trade offs/Tech choices

We a non-relational database instead of a relational one, the relational database would have provided better consistency although it would not have as scalable as a non-relational database like MongoDB
Cassandra would be better than MongoDB in terms of scalability but a column store would not have allowed us flexibility in storing our data
Storing post data becomes very easy because of a document store like MongoDB
Since at no point in time can we lose messages, we opt for a log based message queue like Apache Kafka instead of something like RabbitMQ since we can never lose messages and the order of messages should be preserved as well

Failure scenarios/bottlenecks

A celebrity makes a post that is liked by millions of other users and is retweeted by thousands of users, this can overwhelm the server. A distributed counter service can help in this situation
If a celebrity with millions of followers makes a post, then post must be populated in the feed of millions of users, this can overwhelm the post service

Future Improvements

Adding the functionality for a user to re-post a post made by a different user
Adding filters to post to filter content on basis of certain words
Post moderation service to prevent hateful speech, graphic content
Ability to hide posts from a certain user
Ability to handle deletion of posts by a user, dealing with choices like deletion completely or archival
Ability to handle deleting a user's account, dealing with choices like complete deletion or archival
Ability to have public posts or to have posts directed to a certain segment of chosen users