System Requirements
Functional:
- Users can send (tweet) a message for up to 200 characters.
- Users can follow their friends.
- Users can like another user's tweet.
- User's home page will show tweets of the users it is following in chronological order, with the most recent post appearing first.
- User's home page will also show the top K messages from the users it is following. The popularity of the tweet will be decided based on the number of likes the post has.
- User can block posts of a certain user from appearing on their home page.
Non-Functional:
- Scalability - We expect a high number of daily active users, probably 500 million.
- Availability - The service needs to be highly available.
- Response Time - A user upon entering their home page, should be able to see their feed in under 500 milliseconds.
- Consistency - This can be a bit more relaxed, a user does not immediately need to see the posts from the users it is following, a delay is acceptable. Instead of strong consistency, we aim for eventual consistency here.
- Security - Content moderation and anti-abuse protection can be implemented, although, here I am not going to delve into it.
Capacity Estimation
- We assumed having 500 million daily active users and a maximum tweet size of 200 characters, or 200 bytes.
- We assume that the average user makes 2 tweets a day, total number of tweets made everyday would be 1 billion = 10^9.
- 200 Bytes of message data plus other data in a tweet, we can assume a tweet would consume 512 bytes of storage.
- Data generated each day = 2 * 512 * 10^9 bytes = 10^12 bytes = 1 TB.
- In one year, we generate 365 TB data.
API Design
- tweet (userID, content)
- Allows the user with userID to post a tweet with the given content. Contents of the post would be passed on as a JSON and would be parsed by the server when displaying.
- Every post will have a unique ID auto generated and tagged to it.
- like (userID, postID)
- Allows user having userID to like the post having postID. This increases the like counter of the post with ID postID by 1.
- follow (followerID, followeeID)
- User with the ID followerID starts following the user with ID followeeID.
- unfollow (followerID, followeeID)
- User with the ID followerID unfollows the user with ID followeeID.
- homeFeed (userID, offset)
- Returns a JSON document containing the post data that should be shown on the user's home page.
- offset represents the number of tweets that need to be returned. When a user opens his home page, the offset will be 0. In case the user refreshes the home page, the offset can be a small number like 5 to reduce server load. Also, the offset can be device specific, something like a mobile device does not need a large offset figure.
- As the user scrolls lower in their feed, we can keep requesting more posts using this API.
Database Design
- Based on the capacity calculations done earlier, a relational database is not advised.
- We do not utilize joins in any capacity here, another reason to not use relational databases.
- We opt for a non-relational database that uses a document style store to effectively store the posts made by users.
- MongoDB is a good choice here since it a document store and will allow us schema flexibility.
- The data models are given here.
- Tweet
- Stores the post related data
- postID Primary Key
- userID Index
- created_at Index
- content
- links
- like_count
- user
- Stores user information
- userID Primary
- phone
- name
- date_of_birth
- follow
- Stores user follower relationship. This can be used to quickly fetch the users that follow the user with ID userID. We then know that when that user makes a most, all follows need to see the post
- userID Primary Key
- follows List of users that this user follows
- Tweet
- Instead of implementing the follow relationship in a document store, we could have a graph database that could map the user follower relationship
- The like counter would be a scalability challenge. If a celebrity person makes a post, it will be liked by a large number of people. This will make the like counter for that particular post a hot spot for change. To protect against this, we would implement a distributed counter service for users with over, let's say 10,000 users. This is a number that can be configured later. In this case, shared counters would do the counting, using multiple separate counters implemented in memory, using something like Redis and then once in a while, let's say every 5 minutes, we would hit our document store and update the like counter.
High-level design
- A Content Delivery Network service like Akamai or Cloudfare can be used to serve all the static content.
- A rate limiter can protect against denial of service attacks, intentional or otherwise.
- Tweet service's job is to receive requests for new tweets and populate them in the document store. It can also queue the message in a message queue when can then be utilized for some other service, like analytics.
- Tweet Analyzer pulls messages from the message queue that is populated by the tweet service and analyzes them for hashtags, inappropriate content.
- Home Feed service handles populating the home feed of a user.
- Data should be sharded by region as users of a certain region would be more interested in posts of that region. Globally trending posts could be cached in the CDN and served to all regions.
- Data should be replicated among data centers for greater fault tolerance.
flowchart TD CL[client] --> CDN[CDN] CDN --> RL[Rate Limiter] RL --> LB[Load Balancer] LB --tweet()--> TW[Tweet Service] TW --Store--> DB[DB: MongoDB] TW --> MQ[Message Queue] TA[Tweet Analyzer] --Pull Message--> MQ TA --Update-->DB CA[Content Analyzer] --Pull Message-->MQ LB --home_feed()--> Home[Home Feed Service] Home --> DB Home --> Cache[Redis Cache] LB --like()-->CT[Sharded Counter] CT-->DB
Request flows
- Populate Home Feed
- Get the user IDs of the users following the person making the post.
- Populate the post in the cache of all the users.
- Get Home Feed
- Check if posts are present in the post cache, if yes, get them, if not, hit the DB
- Follow User
- Go to the user ID which our current user wants to follow
- Add ID to the follower list
Detailed component design
Score of Tweet = weight_like * number_of_likes + weight_followers * number of followers on the tweet's author + weight_retweet * number_of_retweets
The top-K posts to be shown on a user's home page can cached for quick retrieval. The top posts can be found by inserting all of the posts in a max heap data structure and then picking the top K posts. The max heapify operation would run based on the tweet score calculated above.
The cached data can be stored in a caching system like Redis, this does not need to be persisted on disk since it is subject to change frequently.
For users that do not log in frequently, getting top K posts can be considered a waste of computing power. We can stop calculating the top K posts for users that have not logged in in the last 7 days, where the number of days is a configurable parameter. For these users, the posts can be populated on login and then these users can be added to the circulation of users for whom posts are cached.
Trade offs/Tech choices
- We a non-relational database instead of a relational one, the relational database would have provided better consistency although it would not have as scalable as a non-relational database like MongoDB.
- Cassandra would be better than MongoDB in terms of scalability but a column store would not have allowed us flexibility in storing our data.
- Storing post data becomes very easy because of a document store like MongoDB.
Failure scenarios/bottlenecks
- A celebrity makes a post that is liked by millions of other users and is retweeted by thousands of users, this can overwhelm the server. A distributed counter service can help in this situation.
- If a celebrity with millions of followers makes a post, then post must be populated in the feed of millions of users, this can overwhelm the post service.
Future Improvements
- Adding the functionality for a user to re-post a post made by a different user.
- Adding filters to post to filter content on basis of certain words.
- Post moderation service to prevent hateful speech, graphic content.
- Ability to block posts from a certain user.