System requirements


Functional:

  • User should be able to see posts/tweet by the people they follow.
  • User should be able to like the tweet.
  • User should be able to post a tweet.
  • User should be able to share a tweet.
  • User should be able to follow other people
  • User should be able to comment on the post.



Non-Functional:

  • The system should be having high availability
  • Eventual-consistency will work fine here
  • The service should be having low latency(especially for read/view operations).
  • Fault tolerance



Capacity estimation

Since this is going to be a read heavy kind of system, we can assume that we are here having 100:1 read:write ratio.

Since we want read-operations to be of very low-latency, we can segregate read servers from write ones.


Assuming that we are having 1e7 DAU(for read operations). So read TPS would be around: 100, and since read/write ratio is 100:1, write TPS is 1TPS.

For saving images/multimedia, we would use S3


API design

PS: I am using tweet and post words interchangibly.


Given the functional req., we could think of following APIs

  • GET: showMeFeed(<userID>, <x-api-key>, offSet) --> Customer facing
    • useriD: represents the id of the user
    • x-api-key: we will leverage this for throttling purpose, if we want to apply throttling/rate-limiting based upon a particular api-key,.
    • Response:
      • <List<Feeds>, next_offset>
        • List<Feeds> indicates the list of Feed object of size 25
        • next_offset: would help in pagination and we will call the showMeFeed second time with the offset(indicates the start of next feed to be shown).
    • The above offset is at client level and could pose some issues if user wants to open the application from browser and tablet, it will show the same feeds again, which isn't wanted. To Counter this, we will maintain the offset at the background(kind of a flag indicating the start pointer of the feed that needs to be shown to a particular user). In that case, our API would look like:
      • GET: showMeFeed(<userID>, <x-api-key>)
        • Response: List<Feeds>
          • Feed: An Object containing author information, timestamps, links to images/videos(S3 url), like-count, List<comments>
            • Comment: As of now, just assuming that it is text only.
  • Post: PostTweet(<userId>, <x-api-key>, <S3_image_url, text>, meta-information, <other_tweet_id>)
    • userId and x-api-key are same as above
    • S3_image_url:- contains uploaded multimedia url
    • meta-info:- contains information like timestamp, location etc.
    • <other_tweet_id>: this would contain the id of another post which user wanted to tweet.
    • Response: 201, created
  • Put: LikeTheTweet(<userID>, <x-api-key>, <tweet_id>
    • Response: 201, created
  • Put: Commentonthetweet(<userId, <x-api-key>, <tweet_id>, <comment_text>)
    • comment_text: the text that user: userID wanted to comment on the tweet_id
    • Response: 201, created.
  • Put: FollowTheUser(followee_user_id, follower_user_id)
    • Response: 201, created.




Database design

We will be needing following Tables:

  • User-> UserID, user_name, meta_information, List<PostIDs>
    • UserId would be the partition-key, I am not making PostID as a separate sort-key as we aren't expecting queries like get me this post from this user rather it would be like, get me all posts from the user: <user_id>, so the above one makes sense.
    • Since, I am not expecting queries on any other field except the user_id, the document-based DB(cosmos) would make sense here. With this I am avoiding multiple duplicated records(in case of key-value DB).
  • Tweet-> (tweet_id, meta_information, tweet_text, tweet_multimedia_url, user_id)
    • tweet_id: id of the tweet.
    • meta_info: contains information like tweet from what device, location, time
    • tweet_text: text in the tweet
    • tweet_multimedia_url: S3 url for the multimedias
    • user-id: id of the user who have post this tweet(this would be the partition key)
    • We could use either key/value based or document based DB here, since document-based DB is more used for analytical purposes(not many write operations are there), I would like to go with document-based DB only.
    • I am not using SQL as we are fine with eventual consistency and doesn't really worry about ACID here.
  • Activity-> (tweet_id, activity_by_user_id, isActivityActive, text, isContainsText)
    • Here activity signifies both like and comment on a post.
    • tweet_id: id of the tweet
    • activity_by_user_id: user_id of the tweet-activity
    • isActivityActive: if user presses the like button twice/deleted the commment, it undoes the record and mark it as in-active
    • isContainsText: this means that the current activity is not of reaction type
    • text: contains text(in case of comment on the feed)
    • We could go with key/value DB(DDB) here as this would be write-heavy kind of table as compared to other tables. tweet_id will be the partition key and activity_by_user_id will be the sort-key
  • Follower-> (user_id_of_followee, user_id_of_follower)
    • Both the user_id_of_followee and user_id_of_follower are indexed
    • This would help us in answering what all users I follow, and what all users follow me
    • We could use graph based DB for connections as well(that would be ideal for such case), but since I don't have hands-on on that, For now, I will stick to using key-value based DB(Can research and compare it with graph-based DDB)



High-level design

Api-gateway: It acts as authz/authn/reverse-proxy server, does throttling for us. We have added a LB in between feedService and api-gateway which helps in segregating read and servers in addition to consistent-hashing within a single type of servers.

FeedService: This service is responsible for showing the feed to the user, updating the feed for a particular user.

FollowService: This service is used to get the followers details, manager the followers list, update followers. Basically manage the lifecycle for connections service

ActivityService: This service is responsible for maintaining the activities performed on a particular tweet. Since this is a write heavy kind of service, we could leverage using a queue(kafka) to handle high volume of write requests.

RankingService: This service interacts with feed-service and performs a pre-defined ranking algo- to show what order the feeds should be shown to the user.

UserService: This maintains the life-cycle of user-profiles.



Request flows

  • showMeFeed-> it interacts with feed-service, which first of all calls the user-service to get the user-details, then it calls the follower-service to gets all of its followers, then for those followers, it gets the feed(from tweet DB), for the list<tweet> it makes a call to rank-service to get the ordered tweets, it sends the top 10(a ball park number) entries of the tweet to the activity-service to get like-count, comments on each of the tweet. Then it responds back with the List<feed_with_activity> to the feed-service which in-turn responds back to the client.
  • PostTweet-> it interacts with feed-service, which first of all calls the user-service to get the user-details, then using this user-information, it make an entry to the tweet table using feed-service. It could leverage two approaches here to notify the followers:
    • push-> where it will get all the followers from follow-service and then notifies them. The feed would be live but it has the risk of fanning out(in case of a celebrity).
    • pull-> whenever a follower comes online or does a poll at regular interval, it fetches the feed from the followee list and using ranking-service to order them and shows to the customer. It has the risk of not getting live updates or un-necessary making poll in case there is no update from followee list.
  • We could leverage a combination of both the above approaches where for non-celebrity, push based approach is fine, for celebrity, a pull based approach is better.
  • LikeTheTweet: It deals with the activity-service
  • Commentonthetweet-> it deals with the activity service.
  • FollowTheUser-> it interacts with follow-service which further integrates with user-service to serve the request.








Detailed component design

Api-gateway: It acts as authz/authn/reverse-proxy server, does throttling for us. We have added a LB in between feedService and api-gateway which helps in segregating read and servers in addition to consistent-hashing within a single type of servers.

FeedService: This service is responsible for showing the feed to the user, updating the feed for a particular user.

FollowService: This service is used to get the followers details, manager the followers list, update followers. Basically manage the lifecycle for connections service

ActivityService: This service is responsible for maintaining the activities performed on a particular tweet

RankingService: This service interacts with feed-service and performs a pre-defined ranking algo- to show what order the feeds should be shown to the user.

UserService: This maintains the life-cycle of user-profiles.





Trade offs/Tech choices

We are segregating read and write servers as read request call-volume would be a lot more as compared to write request.

Since, we want low-latency, especially for read-operations, we can pre-load the feed-to-be-shown to a specific(first 25 records) beforehand only. such that when customer comes online, it doesn't have to perform all the operations at that time. This would help in saving the latenty.

Also we could cache the feed-list for a particular user with some TTL, it is fine if a post is shown with some delay to the customer(after cache expiration), as we are fine with eventual consistency.



Failure scenarios/bottlenecks

We need to emit metrics, have some canaries(imitating customers) running all the time to pre-detect the failure scenarios. We will have multiple replicas for each micro-service to avoid single point of failure and support high availability and fault-tolerance.


Future improvements

Our services interacts interally via means of Grpc clients with exponential back-off retry strategy in place..

We could have a few customer-imitating user-ids available which would fire calls and validate the outputs. These imitators we could use to create auto-cut tickets in case of any failure.

We will be emitting faults/error/P90 latency spike metrics and will have graphs/dashboards built over these(with alarm thresholds set) to pre-emptly catch any failure/fault/error.