System requirements


Functional:


User can tweet (send) up to 140 character message.

User can follow other users.

User can like other users' tweets.

User's home feed will show tweets from the users they are following.

The home feed will show top K popular tweets, based on the number of likes they receive, and the number of the followers the author has.



Non-Functional:


Scalability. It will have to serve a very large population, e.g., 500M DAU.

Response time. User has to see tweets quickly. When user opens home feed, the first 10 tweets should show up within 500ms.

Availability.


Consistency requirement can be a little bit relaxed. It does not require strong consistency like banking transactions. Eventual consistency would suffice. For example, if a user tweets something. If a user in the same geographic region sees the tweet in 1 second, and another user on the other side of the earth sees it after 30 seconds, that would be acceptable.


Security, content moderation, and anti abuse protection are all important, but I will not focus on them in this exercise due to lack of time. I will come back to them if I have time.




Capacity estimation


500M DAU

Each user, on average, tweets twice a day. 1B tweets / day.

Each user, on average, views 100 tweets a day.


At peak, the system should scale up to 20% of DAU - 100M users - interacting with the system.


I will think about storage capacity first.


1B tweets a day. Each tweet is 140 chars. Considering international encoding and metadata, it seems reasonable to assume each message takes up to 512 bytes.


So this data increases by 512GB every day. In two years, it would be 365 TB.


This goes beyond the typical capacity of RDB. So we'll pick a NoSQL DB.


A document based NoSQL DB, e.g., MongoDB, seems like a good candidate. it is more scalable than RDBs. It has schema flexibility, which allows future enhancements of the application.


The data model would look something like:


Tweet document:

  • tweet_ID: primary key
  • created_by: user ID
  • posted_time
  • content
  • medialink: link to picture or video content
  • number_of_likes
  • hashtags: list of hashtag strings used in the tweet.
  • users_mentioned: list of users mentioned in the tweet


User:

  • user_ID: primary key
  • email
  • name
  • nickname
  • DOB
  • gender


Bottleneck analysis:


1 - In this data model, number_of_likes seems to be a challenge to me. If a famous person posts something, and millions of users click "Like" button within one minute, it would overwhelm this document of this database.


One approach to overcome this is to break up this into multiple (let's say 100) sub-counters, and make different DB nodes be responsible for each sub-counter. I will come back to this if I have time.


2 - Another important bottleneck is when a user with millions of followers (e.g. famous people) tweets something, the tweet should show up on million of users' home feed. I will look into this deeper in DB design section.


They should also receive notifications, but I will punt notifications in this particular exercise.


I will consider concurrency in later sections: millions of users viewing the same message concurrently, tweeting messages with the same hash tag, etc.






API design


tweet(user_ID, content)

User represented by user_ID tweets this content.

The content will be parsed by the server side for users and hashtags mentioned.


follow(follower, followed)

User represented by follower user ID will follow user represented by followed user ID.


like(user_ID, tweet_ID)

User represented by the user_ID likes the tweet represented by tweet_ID.


home_feed(user_ID, offset, number)

Returns JSON document containing tweets that should be shown on the user's home feed.

offset represents where in the list of top tweets it should start from. number represents the number of tweets this API requests.

For example, when a user first opens their home feed, it would be from offset 0. It is important for this API to respond quickly, so it would be advisable to keep number relatively small - for instance, 5 or 10 for a mobile app client. After the initial 5 or 10 tweets are shown to the user, the client can request more, potentially in the background, to prepare for more feeds in case user scrolls the home feed.


Error handling would follow HTTP error code. For example, 4XX for client side error (e.g. liking a tweet ID that does not exist), 5XX for server side error (e.g. bug causing some of the APIs to fail). It would be important for client to handle too many requests error (429) correctly. For example, if home_feed() API returns 429, it would indicate the servers are too busy to service this potentially expensive call. Therefore, it would be advisable for the client to want for some time (say seconds) before calling this again.


Future direction: for a more dynamic content update (e.g. home feed), it might be helpful to consider a WebSocket based API for the client to receive real time update from the service. I will not focus on this for now, but may come back if I have time.




Database design


I chose MongoDB as the main database for its scalability and schema flexibility.

These are the data models.


Tweet:

  • tweet_ID: primary key
  • created_by: user ID, index
  • posted_time: index
  • content
  • medialink: link to picture or video content
  • number_of_likes


User:

  • user_ID: primary key
  • email: index
  • name
  • nickname
  • DOB
  • gender: index


Hashtag:

  • hashtag_ID: primary key
  • hashtag: the hashtag string
  • tweets: tweet_IDs who have this hashtag.



I decided to take Hashtag is a separate document from Tweet document because the main functionality we want to implement efficiently is the mapping from hash tag to tweets (e.g. to display tags with this hash tag), instead of the other way around.


I decided to create UserMentioned document, which stores tweet IDs that mention a user, separate from the User document. A user (e.g. a famous person) can be mentioned by many tweets in a short amount of time. If I store the mentions in the User document, we would have to load and save the User document every single time. That seems redundant.


User.number_of_likes would have a scalability challenge. When a famous person tweets, it'd be possible millions of users would press the Like button. This would create the Tweet document a hot spot. To avoid this, I would create a sharded counter service for users with more than, e.g., 1,000 followers. For such users, the sharded counter service would do the counting, using multiple (let's say 100) subcounters, stored in a high performance key-value store like Redis Cache. Once in a while (say every 5 minutes or so), the service would write the number_of_likes in Tweet document.


Follows:

  • follower: user_ID of the user who is following
  • followed: user_ID of the user being followed
  • timestamp


Follows document borrows a concept of normalization from RDB. By having this document separated from User document, we can query the follows relationship in both ways efficiently: (1) given a user, list all the users they are following, and (2) given a user, list all the users who are following them.



High-level design


Client: This represents the user's interface, such as a mobile app or web browser, from which the user interacts with the system.


DNS (Domain Name System): It translates human-readable domain names (like www.example.com) into machine-readable IP addresses.


CDN (Content Delivery Network): This speeds up the delivery of static and dynamic content by caching content at edge locations closer to the user, improving performance and reducing latency.


Rate Limiter: This component restricts the number of requests a user can make in a given time frame, helping to prevent abuse and maintain the service's quality by avoiding overload.


Authentication: This process verifies the identity of a user before they can interact with the system, ensuring that interactions are secure and authorized.


API Gateway: Serves as the entry point for all client requests. It directs API calls to the appropriate services, handles load balancing, and provides an additional layer of security.


Load Balancer: Distributes incoming traffic across multiple servers to ensure no single server becomes overloaded, improving reliability and performance.


Services: These are specialized services that handle different aspects of the system:

  • Post Service: Manages operations related to tweets, like posting new tweets.
  • Fanout Service: Handles the distribution of tweets to the followers of a user, ensuring that all relevant users see the tweet.
  • Notification Service: Manages sending notifications to users about new tweets or interactions.


Post Cache: Temporarily stores recent or popular tweets to speed up retrieval and reduce database load.

(When a user posts a new tweet, it can be written to the Post Cache and the Post DB simultaneously, ensuring that the post is immediately available for other users to view.

When a user requests to view posts, the system first checks the Post Cache. If the requested data is available in the cache, it is served directly to the user, bypassing the database.

If the data is not in the cache, it is retrieved from the Post DB and then stored in the cache for future requests, implementing a typical cache-aside or lazy-loading strategy.)


Post DB MongoDB: The database where tweets are permanently stored. MongoDB is chosen likely for its flexibility and performance with large volumes of data.


Message Queue: Temporarily holds messages to be processed by workers. This helps in managing load and ensuring reliable message processing. The message contains the friend list and the new post ID.


Fanout Workers: These workers take tasks from the message queue and execute them, such as updating user feeds when a new tweet is posted.


News Feed Cache: Caches the news feed data to quickly deliver updates to users without needing to query the database repeatedly. It stores <post_id, user_id>


Graph DB, Amazon Neptune: This specialized database manages complex and highly connected data. In this case, it likely stores relationships like who follows whom, enabling efficient retrieval of friend IDs for the fanout service.


User Cache: Temporarily stores user data, like profile information, to reduce the load on the user database.


User DB: The database where user information is permanently stored, allowing for retrieval and management of user data.


Data should be replicated among data centres, too, for greater fault tolerance.



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

Database choice for this service is an interesting decision.


If I picked RDB for tweets, it would give us strong consistency. However, it would struggle to scale in terms of size (estimated to be 365TB in two years).


If I picked a wide-column database like Cassandra, scalability would even be better than MongoDB. However, the schema would be rigid, making feature enhancements difficult.


MongoDB strikes the right balance - advantage in scalability and performance over RDB, while maintaining advantage such as schema flexibility and normalization than wide column DB.




Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements


If I have time, I will add retweeting.