System requirements


Functional:

a) Users can post text and media tweets including images and videos

b) The length of tweets - 140 characters

c) Hashtag for searching feature

d) Users can follow other users and can view their tweets on their home page

e) Users can save tweets as favourite tweets



Non-Functional:

a) Availability

b) Scalability

c) Reliability

d) Tweet should be delivered within minutes

c) Receive notifications of tweets within minutes

c) Eventual consistency



Capacity estimation

a) 500 M user base

b) 100 M daily active users

c) Follows average 100 users

d) Average 1 tweet per day

e) Logged in 10 times a day to view tweets

f) One out of 4 tweets is an image size 5 MB

g) One out of 5 tweets is video size 100 MB

h) Write TPS -> 10^8 /10^5 -> 1000 writes/sec

i) Read TPS -> 10 * 1000-> 10000 read/sec

j) Image Storage -> 25 *10^6*5 MB /day -> 125 TB / day

k) Video Storage -> 20*10^6*100 MB -> 2 PB





API design

POST /content

Attributes - Content

userID


PUT /content

Attributes contentID

userID

actionID


GET /feed

Attribute - userID

lastAccessTime


POST /follow

Attributes - userID

targetUserID


DELETE /follow

userID

targetUserID




Database design

User {

userID int (4 bytes)

login Varchar (15)

FirstName Varchar (15)

SecondName Varchar (15)

lastLogin TIMESTAMP (8)

}

Following {

userID int (4)

targetUserID int(4)

}

Tweet{

tweetID 8

userID 4

tweetText 256

imageURL 256

videoURL 256

tweetDate 8

noOfLikes

noOfDislikes

}

TweetAction{

actionID

actionType

tweetID

userID

actionDate

comments

}






High-level design

PostTweets






Request flows

  1. User Logged In: The entry point when a user is authenticated and ready to post tweets.
  2. Post Tweets: The action the user takes to compose and submit a tweet.
  3. Object Store (Amazon S3): Stores media such as images and videos associated with the tweets.
  4. Tweet Table: Contains the text of the tweets and URLs to any included media.
  5. Fan Out Service: Gathers follower data to pre-compute timelines.
  6. Message Queues: Distributes the tweets to subscribers.
  7. Subscribers: Responsible for caching the new tweets.
  8. Cache with TTL Eviction Policy: Manages stored tweets with a Time-To-Live policy for cache management and eviction.

Here’s how this flow can be represented in a mermaid diagram:

Post Tweet

Store Images/Videos

Store Text Tweets + URLs

Trigger Fan Out

Get Follower Data

Send Tweets to Message Queues

Deliver Tweets

Append New Tweets

Eviction Policy

User Logged In

Post Tweets

Object Store Amazon S3

Tweet Table

Fan Out Service

Following Table

Message Queues

Subscribers

Cache

TTL Eviction Policy

Explanation of the Request Flow Diagram:

  • A[User Logged In] → B[Post Tweets]: The process begins when a user is logged in and posts a tweet.
  • B[Post Tweets] → C[Object Store Amazon S3]: Images or videos associated with the tweet are stored in the Object Store.
  • B[Post Tweets] → D[Tweet Table]: Textual content and URLs linking to stored media are saved in the Tweet Table.
  • D[Tweet Table] → E[Fan Out Service]: The Fan Out service is triggered to handle the new tweet.
  • E[Fan Out Service] → F[Following Table]: The Fan Out service queries the Following table to get relevant follower data.
  • E[Fan Out Service] → G[Message Queues]: The tweets are sent to the Message Queue for later processing.
  • G[Message Queues] → H[Subscribers]: Subscribers take the tweets from the queue.
  • H[Subscribers] → I[Cache]: New tweets are appended to the local cache.
  • I[Cache] → J[TTL Eviction Policy]: The cache operates under a TTL eviction policy to manage data expiration.

This diagram illustrates the complete flow of a user posting a tweet and how it propagates through the system, ensuring efficient caching and retrieval. Would you like to explore any specific parts of this flow or add additional features?







Detailed component design

Twitter Tab





Trade offs/Tech choices

Database -> MySQL because data requires a relation between users, users shared based on userID

Tweet Table -> NoSQL DB Cassandra for high volume of Read/Write Partition Key TweetID

MessageQueue -> Kafaka or Amazon Kinesis

Cache -> Redis to support TTL




Failure scenarios/bottlenecks

HotKey problem due to influencer users ->

Cache misuse for inactive users





Future improvements

HotKey problem due to influencer users -> Tweets posted by influencer users are added in the pre generated tweet timelines to the followers.

Use Hybrid option for generating Timeline views based on User Analytics. FanOut on write for active users and Fan Out on read for inactive users