Design Twitter - System Design

System requirements

Functional:

User registration and authentication. Due to limited time, we'll not cover much details of this part.

User send tweet. A tweet has limited amount of characters. Take the real twitter as an example, let's limit it to 140 characters.

User may follow other users. If user A follows user B, A will be able to see B's tweets when A logs in and go to their feeds home page.

User read tweets. These are from other users who the current user follows. User may also explicitly visit another user's home page and see their tweets there.

User may interact with other users' tweets, e.g., like, retweet, and/or comment.

User may receive notifications about their followers posting new tweets.

Non-Functional:

Based on real twitter, we have very ballpark estimates. Say we have 500 million DAU, among which 20% write tweets. A user who writes tweet will post 1 tweet/day on average. Say 10% of total tweets have media. Say each media size is 100 KB on avg. For simplicity, we allow up to 1 media per tweet.

Say an average user follows 500 other users, and has 100 followers.

Say an avg user checks their feeds 5 times a day.

Say among all DAU, we have 1 million celebrities, each of them having on average 100k followers. Say 50% celebrities writes 1 tweet/day.

Say we store all tweets data for 10 yrs. After 10 yrs, they expire and are removed.

Capacity estimation

Write QPS = 500M * 20% * 1 / 86400 = 1.2k QPS

Read QPS = 500M * 5 / 86400 = 29k QPS

Peak W/R QPS is typically 2-5x of avg, let's take 3. Peak W/R QPS = 3.6k / 87k QPS

A single server would be very stressed. We need multiple instances.

Data = 500M * 20% * 1 * (140 byte + 10% * 100 KB) = 1TB/day = 3.65 PB / 10 yrs

Majority (99%) of these are media blobs.

We need multiple DB instances.

User profile data is much smaller compared to tweets. We'll skip it.

There'll be also notification traffic. We'll cover some details if we have time.

API design

Write:

URI: POST /tweets
payload:
user auth token
tweet text
optional media blob uid or url
response:
code 200 with payload containing tweet uid

Read:

URI: GET /tweets
payload: user auth token
query params:
limit: number of tweets to return
page limit. If this is less than <limit>, response will be paginated
page_id: for paginated response, this indicates the next page id to fetch
page_token: for paginated response, this indicates the unique token of the initial read request
response:
non 200 codes on error; or 200 code on success
if not a paginated query, a list of tweet objects
if paginated, a list of tweet objects on page_id, the unique token of the request, and a boolean flag indicating if there's more to fetch from the current page

Database design

User profile data is small scale, well structured, and changes are infrequent. SQL may be good choice. We'll skip details as it's not critical part.

We see 99% of storage is media blobs, but their total count is 10% of all tweets. We need separate DB to store media blobs from texts, so they don't interfere with each other.

Media blob DB:

We may need a storage for original media blobs, and a storage for compressed media to be shown on tweets.

Media compressed object:

media_id

tweet_id

user_id

blob: bytes

created_at: timestamp

original_media_id: id to original blob

Media original object:

id: this object uid

compressed_media_id: id to compressed media blob

user_id

created_at

Data is too big, we need partition. Also for availability and disaster recovery, we need replicas.

Tweet Text DB. Let's call it metadata DB:

Tweet object:

tweet_id: this matches the tweet_id in media DB, if any

user_id

media_id: optional

created_at: timestamp

last_updated_at: timestamp

text: str up to 140 chars

likes: num of likes. Due to limited time, we'll not cover who likes a tweet, but just count a total number

original_tweet_id: if this tweet is a retweet from an original tweet

Comment object:

comment_id

parent_id: the tweet or parent comment this comment belongs to

user_id

created_at

last_updated_at

likes

For simplicity, we don't allow media in comments

Data of metadata is also big. Similarly, we need partition and replicas.

We'll cover details later.

High-level design

See diagrams. Some will be explained in detailed component design section.

Request flows

Write:

Assume user is logged in
User composes a tweet. If the tweet has a media:
user uploads a media. The upload request is sent to media upload service
media service sends the original media to original upload queue. The upload queue will push the original object to the original media blob DB in the background
media service runs some compression algorithm to reduce the media size and transcode the format to be consistent
media service stores compressed object to the compressed DB
media service responds with the compressed media object id
client sends the post tweet request to server (along with optional media object id).
server creates an entry in tweet DB, and responds
user get ack or nack
The notification server may optionally send out notifications to user's followers

Read:

Assume user is logged in
User goes to home page
client sends read request to server
server pulls data from tweets DB, or cache, or CDN, and responds
client renders server response object
If user scrolls down to bottom, this will trigger a paginated request to server

Detailed component design

DB partitions and replicas:

We need leader and followers nodes for availability and disaster recovery.

The leader may push updates to followers and let followers to update async.
Pros: less latency on user side.
Cons: data may be out of sync if the leader goes down.
Mitigation 1: have write-ahead logs that write append-only changes before every update, to reducing the discrepancy gaps.
Mitigation 2: have write consensus from some followers before ack user. The drawback is this will increase latency on user.
We can increase consistency by getting write consensus.
Pros: more consistent data on read
Cons: more latency on user side

We need partitions because the data is too big. We can partition by tweet id. We may generate a tweet id by combining server timestamp, sequence number, and some other factors. Then, we use consistent hashing to map it to a partition node.

We can use similar partition techniques for the media DB. However, the consistent hashing function of media DB needs to consider tweet id as well, such that it can put linked tweet and media to physically close nodes to reduce network latency.

We can have each node to act as a leader for some partition(s), while as followers for other several partitions. This helps to further distribute traffic more evenly. The cons is that this increases design complexity a lot.

We can let read requests go to followers, to reduce stress on leader so that leader can have enough bandwidth handling writes. It's ok for readers to experience a slight data delay/inconsistency. However, for read-after-write, i.e. we still want the request to hit leader, otherwise the user who just wrote a tweet will not see its new tweet. We can achieve this by deterministically route user to a server, as explained below.

Server:

We need multiple server instances due to the high traffic. We distribute traffic based on consistent hashing. Routing options:

based on user id or auth token.
Pros: more evenly distributed traffic. Deterministic routing as long as user is logged in.
Cons: some users may be route to a server very remotely, experiencing latency
based on user geographic locations.
Pros: less network latency
Cons: congested load on each node during peak hours at each node location
A combination of both, based on user id or token, routing to one of nearby servers (i.e. we deploy several servers in each big geo locations).

In addition, we notice read QPS and throughput is high, especially in peak hours. We consider adding CDN. CDN will cache popular content, based on their geographic locations, so that we can save a lot DB round trips to further reduce read latency and thus reduce stress on servers.

Note that during user reads, the original design requires server to pull tweets from user followers, which may reside in different DB instances, a big latency hit.

Alternative: for each tweet, we create another feed object under each follower's feed table. When a follower logs in, the request just need to pull feeds from this single table.

Exception: celebrities have many followers. For them, we can do lazy write that only if a follower pulls feeds, then we write a celebrity feed to their table.

Pros: reduce latency significantly on read.

Cons: we introduced another big chunk of feeds data, and it's not normalized.

Design decision: if we go this route, we choose to use append-only DB or key-value stores. For append-only DBs, they're very fast on write. They may not be fast on read, but mostly for old data. Because people will almost never access old feed data, the read degradation on old data is acceptable. For key-value stores, they're fast on both write and read, and we have unique tweet ids and media ids, so it won't be adding too much complexity.

Handling expiration:

It's totally ok if an expired data is removed after expiration date with some delay. We don't want to do expired data removal on-the-fly during user requests, as the only connection between expired data and new data is the user and timestamp, whereas such op will add latency in request. We run async batch job in background on the daily basis to remove expired data.

Trade offs/Tech choices

As explained in detailed component design section.

Failure scenarios/bottlenecks

Server failure:

We distribute traffic by consistent hashing. If a server is down, these requests will be routed to the next nearest server on the consistent hashing ring. And because we can randomly distribute servers as virtual nodes on the ring, the newly distributed traffic will not be congested in a single node.

Database failure:

If a database leader is down, we need to route its traffic to the nearest virtual node on the consistent hashing ring.

There may be small inconsistency of data (that leader has not propagated to followers yet). This usually is ok for most users, as they will just experience some delay in seeing new tweets, that they may not even notice about. But for people who just post new tweets to the dead leader, they will be confused. To mitigate this, we want to have write consensus from some followers, so that at least some follower has up-to-date data. The drawback is more complexity and latency during write. Also, we have to make sure the temporary traffic redirection should route the traffic to one of these followers that have the most up-to-date data, which is another layer of complexity.

Future improvements

Due to time limit, we didn't cover these items in details:

notifications
likes
comments

Future development of these may change the current design, especially database schema and partition methods.