System requirements
Functional:
- Ability to compose tweets:
- persist forever
- text only (may contain urls to other multimedia, but will not store that media here)
- 280 char max
- Ability to share tweets
- Ability to track updates of other users (new followee tweets appear in feed)
- Ability to favorite tweets
- Ability to associate a tweet with other tweets (hashtag)
- Ability to view another user's tweets
Non-Functional:
- horizontally scalable - loadbalanced microservice architecture
- highly available
- https
Capacity estimation
- 100 million current users
- Growth of 100 thousand per day (~3 million per month)
- average number of followers per user - 500
- average number of followees per user - 300
- max tweet size is 280 characters
- average 5 tweets per user daily (300million users * 5 tweets/day = 1.5 billion tweets daily ~ .5 trillion / year)
- average hashtags per tweet - 2
- average 10 favorited tweets per user per day (3 billion / day, ~1 trillion / year)
API design
- REST API - returns JSON payload
- user service:
- /login
- POST request with username and password in authorization header (base 64 encoded)
- Returns login token and 200 OK on success (user exists and correct password)
- Returns 401 Unauthorized if invalid login credentials
- /create-user
- PUT request with username and password in authorization header (base 64 encoded)
- Returns 200 OK on success (new user)
- Returns 400 Bad Request on error (e.g. user already exists)
- feed service:
- /get-feed?user-id=<user_id>&date=<YYYYMMDD>&max-count=<max # of most recent tweets>
- GET request with user id
- Returns list of at most |max-count| most recent tweets from followees and 200 OK on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., the user does not exist)
- /view-user?user-id=<user_id>&date=<YYYYMMDD>&max-count=<max # of most recent tweets>
- GET request with target user id
- Returns list of at most |max-count| most recent tweets starting from date and working backwards and returns 200 OK on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., the user does not exist)
- /get-favorite?user-id=<user_id>
- GET request with user id
- Returns list of favorite tweets and returns 200 OK on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., the user does not exist)
- tweet service:
- /tweet
- PUT request with tweet content in request body
- Returns 200 OK and tweet id on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g. tweet too large or empty)
- /favorite?user-id=<user_id>&tweet-id=<tweet_id>
- PUT request with tweet id and user id as request parameters
- Returns 200 OK on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., user or tweet do not exist)
- /share?user-id=<user-id>&tweet-id=<tweet_id>
- PUT request with tweet id and user id as request parameters
- Returns 200 OK on success
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., user or tweet do not exist)
- follow service:
- /follow?followee=<followee user_id>&follower=<follower user_id>
- PUT request with follower and followee ids
- Returns 200 OK on success (if added follow association OR already exists)
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., either user id do not exist)
- /unfollow?followee=<followee user_id>&follower=<follower user_id>
- PUT request with follower and followee ids
- Returns 200 OK on success (if removed follow association OR does not already exist)
- Returns 401 Unauthorized if not logged in
- Returns 400 Bad Request on error (e.g., either user id do not exist)
Database design
- user table:
- schema: <user_id VARCHAR(15), password_hash CHAR(8), create_time TIMESTAMP, last_login TIMESTAMP, email VARCHAR(255)>
- row size ~ 15 + 1 ('\0') + 8 + 8 + 8 + 255 + 1 ('\0') bytes = 296 bytes
- total size ~ 300million users * 296 bytes ~ 90000000000 bytes = 90GB
- Growing at 100k daily users -> 300million / 100k users/day = 3000 days until doubles in size (roughly 8 years), so no real capacity worries here.
- follow table:
- schema: <followee_id VARCHAR(15), follower_id VARCHAR(15)>
- row size = 32 bytes
- (500 followers per user (average) + 300 followees per user (average)) * 300million users * 32 bytes = 7680000000000 bytes = 7.68 TB
- tweet table:
- schema: <tweet_id CHAR(16), user_id VARCHAR(15), time TIMESTAMP, content VARCHAR(280)>
- row size = 16 + 15 + 1 + 8 + 280 + 1 bytes = 321 bytes
- total size ~ 300million * 321 bytes * 5 per day ~ 481.5 GB/day * 365 ~ 175.7475 TB / year
- For scalability, should shard on the user id.
- hashtag table:
- schema: <hashtag VARCHAR(140), tweet_id CHAR(16)>
- row size = 140 + 1 + 16 = 157 bytes
- total size ~ 2 hashtags / tweet * 300 million users * 5 tweets / day = 3billion hashtags / day * 157 bytes = 471GB / day, so roughly same size as tweet table
- favorite table:
- schema: <tweet_id CHAR(16), user_id VARCHAR(15)>
- row size = 32 bytes
- total size ~ 32 bytes * 300 million users * 10 tweets / day ~ 96GB/day ~ 35TB/year
High-level design
- API Gateway directs incoming requests to the services
- User Service - handles creation and authentication of users.
- Tweet Service - provides for tweeting, retweeting, and favoriting tweets.
- Follow Service - handles the association/disassociation of one user to another (i.e., follower <-> followee).
- Feed Service - handles retrieving relevant tweets, viewing latest user tweets, and retrieving favorite tweets.
Request flows
- User Service - User table split for writes and read - user service will always use write table, but other services will rely on read replicas.
- Tweet Service - A message queue allows tweets to be persisted asynchronously - expect the Feed Service to disseminate tweets to followers. For tweets and sharing, the tweet content is written to the Tweet DB Master. For favorites, the tweet is sent to the Favorite Service for coordination between user db (read), tweet db (read), and write the favorited tweet record in Favorite DB Master. Newly created or favorited tweets are sent to all the feed caches.
- Follow Service - The follow record is written to the Follow DB Master and its read replica is utilized by the Feed Service. All follow changes are sent to the feed caches.
- Feed Service - The Feed Service first checks the cache and if the cache is up-to-date (configurable TTL), returns immediately with cached data. If the cache contains old data, the feed is updated by querying the tweet read db, the favorite read db, and the follow read db.
Detailed component design
flowchart TD
client[client] --> api_gateway{API Gateway}
api_gateway --> |/login, /create-user|user_service[User Service]
api_gateway --> |/get-feed, /view-user, /get-favorite|feed_service[Feed Service]
api_gateway --> |/tweet, /favorite, /share|tweet_service[Tweet Service]
api_gateway --> |/follow, /unfollow|follow_service[Follow Service]
user_service --> |/create-user, /login|user_db_master[(User DB Master)]
user_db_master --> user_db_read_replica[(User DB Read)]
tweet_service --> |/tweet, /share|tweet_write_mq{{Tweet MQ}}
tweet_write_mq --> tweet_db_master[(Tweet DB Master)]
tweet_db_master --> tweet_db_read_replica[(Tweet DB Read)]
tweet_service -->|/favorite|favorite_write_mq{{Favorite MQ}}
favorite_write_mq --> favorite_service[Favorite Service]
favorite_service <--> user_db_read_replica
favorite_service <--> tweet_db_read_replica
favorite_service --> favorite_db_master[(Favorite DB Master)]
favorite_db_master --> favorite_db_read_replica[(Favorite DB Read)]
follow_service --> follow_write_mq{{Follow MQ}}
follow_write_mq --> follow_db_master[(Follow DB Master)]
follow_db_master --> follow_db_read_replica[(Follow DB Read)]
feed_service <--> feed_cache
tweet_db_read_replica --> tweet_feed_mq{{Tweet Feed MQ}}
tweet_feed_mq --> feed_cache
follow_db_read_replica --> follow_feed_mq{{Follow Feed MQ}}
follow_feed_mq --> feed_cache
favorite_db_read_replica --> favorite_feed_mq{{Favorite Feed MQ}}
favorite_feed_mq --> feed_cache
feed_cache <--> tweet_db_read_replica
feed_cache <--> follow_db_read_replica
feed_cache <--> favorite_db_read_replica
Trade offs/Tech choices
- Message queues are used extensively for scalability, however, this comes at the cost of feed latency (feeds could be minutes out of date, depending on cache update requirements)
- User DB, Favorite DB, and Follow DB can be all relational since the data sizes are fairly consistent.
- Tweet DB is non-relational since the tweet content is highly variable in size. Furthermore, non-relational dbs tend to have better textual search capabilities.
Failure scenarios/bottlenecks
- Feed Cache - if the cache has a lot of misses (or out-of-date), then more reads must be performed against the favorite, tweet, and follow read dbs. It's possible that not all the read dbs are up-to-date as the newly written data may not have been replicated yet. As a result, the cache would say it's up-to-date and return potentially incomplete results (e.g., missing new followee's tweets).
- As tweets are sharded by user, a famous user with a large number of followers could cause shard imbalances as well as delays in feed processing.
Future improvements
- Implement hashtag service
- Implement textual tweet search
- Implement user search
- Implement private/public tweets