System requirements


Functional:

  • Users should be able to create tweets
  • Users should be able to view tweets from other users whom they are following
  • Users should be able to like other tweets
  • Users should be able to share the tweets(retweeting)
  • Users should be able to add hashtags
  • Users should be able to tag people
  • Users should be able to follow other users



Non-Functional:

  • App should be available
  • App should be reliable
  • App should be scalable. With the growing number of users our app should be able to handle more number of users
  • App should implement rate limiting. There’s should be a limit set for all users so that we can handle malicious attacks.
  • App should have monitoring and logging to make sure that everything is functioning properly.



Capacity estimation

Number of Users: 100M

Active Users: 1M

write:read ⇒ 1:10


WRITE:

Let’s say we have a character limit of 100

we have 1M active users:

1M * 100 bytes (1 byte per char) ⇒ 100MB

Every day we have 100MB data:

for 1 year ⇒ 365 days ⇒ 100MB * 365 ⇒ 36.5GB

for 10 years ⇒ 10 * 36.5GB ⇒ 365 GB

So at this point it makes sense to have a NoSQL DB as it’s easily scalable and it has schema flexibility so in future if we would want to make changes to the table it would be easier.


READ:

1M * 10 views ⇒ 10M views per day

10M * 100 bytes ⇒ 1 GB per day


API design

Create a tweet: /api/post/user_id/tweet

Response:

{

status: 200 OK

message: “tweet created”

}

Get tweets for the timeline: /api/get/user_id/tweets

Response:

{

[

{

user_id: “1234”

tweet_id: “456”

tweet_url: “https://s3.7233abcdef.com

}

{

user_id: “4567”

tweet_id: “789”

tweet_url: “https://s3.7869abcdef.com

}

}

Like a tweet: /api/post/user_id/follower_id/like

Response:

{

status: 200 OK

message: “tweet liked”

}


Follow other users: /api/post/user_id/follower_id/follow

Response:

{

status: 200 OK

message: “You are now following XYZ”

}



Database design

User

  • user_id: primary key
  • nickname
  • email
  • dob
  • first_name
  • last_name
  • gender

Tweet

  • tweet_id: primary key
  • user_id
  • tweet_url
  • number_of_likes
  • hashtags
  • tagged_users_ids
  • hashtag_ids

Follows

  • follower_id
  • followee_id

Favorites

  • user_id
  • liked_tweet_id

Hashtags

  • hash_tag_id: Primary key
  • hashtag_content




High-level design


User will create a tweet. The API server will send the tweet data to s3 bucket where we will store the tweet data and in the tweet table tweet_path_url field will save the s3 url along with other tweet details.


Now let’s say user wants to view their timeline. We request the server to show us the latest tweets. The

server queries the database and find all the new tweets to be displayed on the timeline. this is called pull mechanism.


In reality we should have a timeline feed created for the user and as soon as user comes online we should display all those tweets and also start creating another timeline for the user at the backend.


we wanna add load balancer between user and api servers as we want to make sure that the traffic is evenly distributed and at the same we want load balancer to take care of authentication and rate limiting.


we wanna have separate api servers for read and write requests to distribute the load even furthur.


As we know our read load is more than our write load, we should definitely have some kind of caching. we can use some kind of memcached for cache our data. we can employ 80-20 rule that 20% of tweets will generate 80% of the traffic that we can store in redis.


we can also have cache db. Instead of querying the same data from the primary db, we can have read replicas where all the read requests can be directed.






Request flows

User will create a post and would connect to the server to create a tweet. the server will store tweet data to s3 and we will store all the details of tweets to the tweets table. the s3 path will be store in tweet url.



Detailed component design

let's talk about sharding the data as we are going to have large sets of data. so we have multiple ways of sharding our data.

  1. Sharding using user id
    1. let's say we have 10 shards then we can shard based on user_id like user_id%10 and then assign it to that shard.
    2. using this technique though we can have shards with uneven data.
  2. Sharding using tweet_id+creation_time
    1. this is better than using user_id but we need to make sure that our tweet id is unique and that will evenly distribute the data. The downside is that if we need to fetch the tweets of a single user then we would still require to query everywhere.

We should definitely talk about data deduplication. we should have a secondary database because if we will only have one database that'll be our single point of failure. when we create a tweet we should also update secondary database as in case of primary database failure we can make our secondary database our primary database.




Trade offs/Tech choices

  1. Instead of using sql db i would prefer using NoSQL DB as the schema is not only flexible, it'll be easy to scale our system. If we would like to add more features, it'll require us to add more fields to existing tables and maybe add more tables, which is easily possible if we use NoSQL instead of SQL database.
  2. Instead of just using read replicas which we would use for read queries, we would first try to use redis as if we can handle our read traffic easily using redis the we won't need extra db servers to take care of our read requests. we can limit read replicas and hence save some maintenance cost.





Failure scenarios/bottlenecks

In case a tweet goes viral and lots of users want to access that tweet, we want to make sure that our server can handle that traffic. right now we don't have anything that can make our server handle insane traffic cz if the data is stored in one of our server then the server won't be able to handle it.




Future improvements

I would definitely like to implement proper rate limiting algorithm to make sure we can handle bot attacks and also make sure that users are not sending insane amount of read and write requests.

I would also like to have some kind of analytics for example to see what time of the day we get peak traffic. when was our app slow. how's our read and write requests performing.