My Solution for Design Twitter

by zinan

System requirements


Functional:

  1. Publish/delete a tweet
  2. Type of the tweet user can publish? Text and image
  3. View news feed
  4. Like a tweet
  5. Follow other people



Non-Functional:

  1. High scalability. We should make our system highly scalable to handle the hign volume of traffic (read and write)
  2. High available. Twitter should be highly available in order that user can view and publish tweets without any issues.
  3. Eventual consistency. It's ok if one person seeing a stale tweet and other person seeing the update to date one, as long as the tweet consistent eventually. Also we can build our system based on this and have high performance.




Capacity estimation

  1. DAU - 10 million users per day
  2. Read/Write radio - 8:2
  3. Size of tweets: Text - 100 bytes, image - 5kb
  4. Radio of text tweets and image tweets that user going to be published: 8:2
  5. Total usage per day: 10000000 * 0.2 * 5100 = 2000000 * 5100 = 10100000000 bytes = 10 GB
  6. If we store the tweets for 5 years, we would need 10 * 365 * 5 = 18 TB
  7. Bandwidth: 10000000 * 0.2 * 5100 / 24 / 60 / 60 = 117 kb / s





API design

We can have a version in the url such that we can have more that one version of APIs can be called.


1.Post a tweet.

/user/tweets/v1/post

Type: Post

Parameter: userId, tweetId, postTime, tweetType

2.View news feed - globally

/user/tweets/v1/view

Type: Post

Return: List of twitter news feed, each feed may contains tweetId, userId, postTime, tweetType

3.View news feed friends

/user/tweets/v1/view

Type: Post

Parameter: friendId

Return: List of twitter news feed, each feed may contains tweetId, userId, postTime, tweetType

4.Like/unlike a tweet.

/user/tweets/v1/like or /user/tweets/v1/unlike

Type: Post

Parameter: userId, tweetId

5.Follow other people

/user/tweets/v1/follow

Type: Post

Parameter: userId, friendId, follow time




Database design

Due to the big volume of traffic I prefer non-relational database over relational databases, although relational database can be replicated and scaled properly, it would introduce additional complexity and table split or database split, which have to design the better key of this distributed databases.

Non-relational database would be highly scalable to deal with the high volume of traffic, and we can choose Cassandra for storing the datal.

Also we can store the image resources to the Object storage such as Amazon S3 to reduce the storage usage of our database.


Tweet table:

TweetId - primary key

TweetType - text, image, video, etc.

TweenContent - original content if text, url for the path of S3 for the image and video.


User table:

UserId - primary key

UserName

Sex

Email

Password - encrpted

Register time

Status - Activated/inActivated


User-friend relationship table:

UserId

FriendId

Followed time


User-liked tweets table:

UserId

TweetId

Event - like/unlike




High-level design

CDN: For fast lookup by storing the static resources and other hot resources in advance

API Gateway: We rely on the rate-limit from API gateway to reduce the complexity of our back-end services, and distribute the requests to the binded application via load balancer.


Publishing service: Deal with publishing/deleting tweets, view news feed, like/unlike tweet, follow/unfollow a frient, etc.


Amazon S3: Storing the image or video that user published


Fan-out service: If some influncers have published a tweet, it will be delievered to fan-out service to update the Redis cache for the tweet details for his followers rather than fetching it while user requesting it. it can significantly reduce the pressure for the database and make the request more efficient. For other normal tweet we just update the Redis cache after completely retrived from the database.


Feed service: Take and process of the loop up of twiter news feed


In-memory message queue: Receive the message published from publishing service, and let the fanout service consume from it.


Notification service: When you followed someone, or some of your friends posted tweets, it's better for you can receive the notifications so that you can check and view.


Redis: Global cache that storing data using LRU strategies with expiration time, we update the cache via Cache-aside


Database: Non-relational database to handle the high volume of requests and ensure read/write speed






Request flows

Tweet publishing flow

User published a tween with image, the request will be sent to API Gateway and check whether limited or not, and routes to the publishing service.

Publishing service would take and process the request and then publish to message queue.

Fanout service would consume the message and persist the metadata into tweet table, and meanwhile asynchronously send the image to Amazon S3, and update the url in tweet table and then update the Redis cache. If the request comes from a people that have a lot of friends like a influncer, send the request asynchronously to fan-out service and update the Redis cache.

Moreover asynchronously send the request to the notification service such that it can be delievered to the friends he followed. Notification service will take and send the request carries the content to the Third party services.


View news feed of friends

User make the requests to nearest CDN node and retrieve the resources like image or video if present, and also make the requests for other data through feed service. Then it will check from Redis cache first and get the data if present, or fetch from database, update the cache and return to the client.




Detailed component design

1.What can be stored in CDN?

We can store some static resources in CDN to speed up the loading process such as user's avatar, pictures that less frequently used, etc.


2.Push-based or get-based request type for using CDN?

Actually we can use both based on the situation. For example, if some influencers published a tweet with pictures or videos, it can be expected that it would be viewed by so many people, we can push these data in CDN after published in terms of fast retrival of the users.

Moreover, if CDN does not have the data that user required, the request will be forwarded to the load balancer of the application, and when data completed returned we can update the CDN as well in order that other people interested in the same content can get the fast retrival.


3.Client-end merging or service-end merging?

When client make a request for viewing news feed, we can either let client make one request to CDN to get some image or video resources and make another request to the service to get other data, and then combine them on the client side. This will introduce more complexity on client side and also used more bandwidth, and also need to handle the situation when one request success and the other one failed.

Alternatively, we can use backend merging, after data got fetched we would get the resource url, and then make the request to CDN and merge the response and return it. it would also introduce complexity on the backend side, but it would not used more bandwidth for client


4.Why in memory message queue over disk-based queue?

We selected in-memory queue as we want to have the lower latency for our service, we can use round-robin strategy to consume the message to achieve best performance as it does not matter whether my tweet posted first or the other ones posted first if we both posted at the same time.


5.How to make sure the consumer consume message exactly once to prevent the issue like duplicated tweets?

Firstly we can have a Global Message ID Generator that can be generated the global unique id, and the request would take this messageId and persist the data, and messageId would be the primary key in the tweet table. In this way even due to network error and retried several times, they all have the same messageId and all duplicate request would be rejected by the database due to constriants of primary key.






Trade offs/Tech choices

Soft delete/hard delete of the user record

  1. We can use soft delete over hard delete to basically keep the record but change the status. However, for twiter-liked system there would be a really big amount of users gona be registered and used, so maintaining this status is relatively expensive.
  2. However, we can benifit from it if we use soft delete. Since we did not delete any record from this user, we can easily get all the information back if user decided to come back

Relational databse/non-relational database

  1. Relational database can maintain the relationship status properly, and also can build the foreign key of userId or tweetId in terms of any deletion of user or tweet.
  2. However, our DAU is very high and relational database could not deal with that, so we can utilize non-relational database to get the better read/write performance and scalability, it's easier to scale the instance of database compare to relational database




Failure scenarios/bottlenecks

  1. Failed to fetch the data from CDN due to CDN downtime
  2. Backpressure for the Message queue, when published many tweets but consumed slowly
  3. Bottlenect for publishing service under some festivals
  4. Duplicated tweets due to duplicated consumption




Future improvements

  1. Failed to fetch the data from CDN due to CDN downtime - Make a new request to the Redis cache or database
  2. Backpressure for the Message queue, when published many tweets but consumed slowly - scale out the fan-out service to efficiently consume the message
  3. Bottlenect for publishing service under some festivals - We have load balancer sit before publishing service, so we basically can scale out the publishing service and make the request to be handled by more number of instances
  4. Duplicated tweets - We might saw some duplicate tweets if our fanout service not designed idempotently.