System requirements


Functional:

We need to support the following functionality:

  • user creates and posts a tweet
  • user start/end following another user
  • home feed with tweets from users followed by the user


Registrations, authorisation, notification functionalities are also very important, but we'll leave them out of scope for now.



Non-Functional:

We want the system to

  • have high availability - we're going to achieve this by applying services and DB sharing
  • be scalable and handle peak loads - here stateless services will help, since they can be added and removed according to the curren load
  • have low latency - low latency for write requests will be achieved by functional partitioning of the DB, and for read requests - by caching and leader-follower replication


Capacity estimation

  • 100k DAU
  • Peak values can reach x10+
  • 5 tweets per day on average
  • may scale along with the DAU (so x10+ also)


Let's assume a tweet is ~70 characters long, so it takes about 140B of storage, and every now and then (let's say 1/5 of all tweets) users post a photo (~5MB) then will need

10 ^ 5 * 4 * 70 = 28MB of storage per day for text content right now and up to x20 later on when DAU base and their activity has grown.

10 ^ 5 * 5 * 10 ^ 6 = 5 * 10 ^ 11 = 500GB of storage per day for storing photos, which we can reduce by preprocessing and optimising the original files.




API design

  • POST /api/v1/tweets/new - returns a status code with some metadata about the new tweet or an identifier for a processing status requests
  • PUT /api/v1/tweets/{tweet_id}/like - returns a status code
  • POST /api/v1/users/{user_id}/follow - returns a status code along with some metadata about the user followed
  • GET /api/v1/tweets/user_feed/{user_id} - returns a paginated collection of tweets for a specific user, according to a business logic (e.g. top K popular/newest)


Database design

We'll need the following entities:

  • Users
  • id - unique identifier
  • user_info - some set of data about the user: name, address, etc.
  • register_date - date of an account creation
  • Tweets
  • id - unique identifier of a tweer
  • user_id - identifier of a user created the tweet
  • tweet_metadata - some meta data, like location
  • contents - we can store content itself, or some reference to another storage (like s3 bucket for photos)
  • created_at - creation timestamp
  • Follows
  • follower_id
  • followee_id
  • created_at


The most loaded is expected in Tweets DB, so we'll have it partitioned by user_ids and have a leader-follower replication in place.



High-level design






Request flows

  1. User posts a tweet
  2. a request with a tweet text is sent to a Tweet service
  3. It either adds the tweet to the database, or sends photo content for processing and returns some identifier to the user which they can use to check processing status
  4. [Optional] Notification sent via Notification service about a new tweet
  5. User follows another user
  6. a follow request is sent to the Follow service
  7. the relation between the users is updated in the DB
  8. [Optional] Notification sent via Notifications service about a new follower
  9. User likes a tweet
  10. a request is sent to the Likes service
  11. likes state is updated in the DB
  12. [Optional] Notification is sent about new like
  13. User requests their feed
  14. a request is sent to the Feed service
  15. Feed service constructs a feed and returns it to the user




Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

  • Pull/Push model for feeds - we'll have to decide on whether we're going to use a push-model or pull-model for building followers feeds.
  • in the push model we can have a separate DB for storing pre-built feeds and have them updated whenever someone a followed user posts a new tweet. In can require a high amount of processing for users with a huge followers base.
  • in pull model we'll have to construct a feed at the moment of a user's requesting it. In that case it imposes more load on the DB for every feed request, but we can mitigate this risk with caching and read-only replicas of the DB, also taking into consideration that we're okay with an eventual consistency here.


Failure scenarios/bottlenecks





Future improvements

We can later add more capabilities to our system:

  • introduce instant messaging for users
  • add premium features and subscriptions
  • followers scopes