System requirements
Functional:
We need to support the following functionality:
- user creates and posts a tweet
- user start/end following another user
- home feed with tweets from users followed by the user
Registrations, authorisation, notification functionalities are also very important, but we'll leave them out of scope for now.
Non-Functional:
We want the system to
- have high availability - we're going to achieve this by applying services and DB sharing
- be scalable and handle peak loads - here stateless services will help, since they can be added and removed according to the curren load
- have low latency - low latency for write requests will be achieved by functional partitioning of the DB, and for read requests - by caching and leader-follower replication
Capacity estimation
- 100k DAU
- Peak values can reach x10+
- 5 tweets per day on average
- may scale along with the DAU (so x10+ also)
Let's assume a tweet is ~70 characters long, so it takes about 140B of storage, and every now and then (let's say 1/5 of all tweets) users post a photo (~5MB) then will need
10 ^ 5 * 4 * 70 = 28MB of storage per day for text content right now and up to x20 later on when DAU base and their activity has grown.
10 ^ 5 * 5 * 10 ^ 6 = 5 * 10 ^ 11 = 500GB of storage per day for storing photos, which we can reduce by preprocessing and optimising the original files.
API design
- POST /api/v1/tweets/new - returns a status code with some metadata about the new tweet or an identifier for a processing status requests
- PUT /api/v1/tweets/{tweet_id}/like - returns a status code
- POST /api/v1/users/{user_id}/follow - returns a status code along with some metadata about the user followed
- GET /api/v1/tweets/user_feed/{user_id} - returns a paginated collection of tweets for a specific user, according to a business logic (e.g. top K popular/newest)
Database design
We'll need the following entities:
- Users
- id - unique identifier
- user_info - some set of data about the user: name, address, etc.
- register_date - date of an account creation
- Tweets
- id - unique identifier of a tweer
- user_id - identifier of a user created the tweet
- tweet_metadata - some meta data, like location
- contents - we can store content itself, or some reference to another storage (like s3 bucket for photos)
- created_at - creation timestamp
- Follows
- follower_id
- followee_id
- created_at
The most loaded is expected in Tweets DB, so we'll have it partitioned by user_ids and have a leader-follower replication in place.
We'll need some indexes as well:
- for tweets based on a publication date
High-level design
Request flows
- User posts a tweet
- a request with a tweet text is sent to a Tweet service
- It either adds the tweet to the database, or sends photo content for processing and returns some identifier to the user which they can use to check processing status
- [Optional] Notification sent via Notification service about a new tweet
- User follows another user
- a follow request is sent to the Follow service
- the relation between the users is updated in the DB
- [Optional] Notification sent via Notifications service about a new follower
- User likes a tweet
- a request is sent to the Likes service
- likes state is updated in the DB
- [Optional] Notification is sent about new like
- User requests their feed
- a request is sent to the Feed service
- Feed service constructs a feed and returns it to the user
Detailed component design
- Tweet Service would be relatively simple, it will only store new tweets in a storage and send a notification about it to the Notification Service
- Follow Service will do the same but with a follow objects
- Like Service will just update likes counter in a likes storage for a tweet
- Feed Service will be the most complex:
- In pull-model it will go through all a user's followers and get N newest tweets to show
Trade offs/Tech choices
- Pull/Push model for feeds - we'll have to decide on whether we're going to use a push-model or pull-model for building followers feeds.
- in the push model we can have a separate DB for storing pre-built feeds and have them updated whenever someone a followed user posts a new tweet. In can require a high amount of processing for users with a huge followers base.
- in pull model we'll have to construct a feed at the moment of a user's requesting it. In that case it imposes more load on the DB for every feed request, but we can mitigate this risk with caching and read-only replicas of the DB, also taking into consideration that we're okay with an eventual consistency here.
Failure scenarios/bottlenecks
- the main bottleneck in this design would be the feed construction, specifically for users with a big followers base
- we can either optimise for latency and prebuild feeds and use push model (update it whenever a followed user posts new tweet), or optimise for storage and construct feeds on the fly, when a user requests their feed
- one other option would be to use a hybrid approach and use both push and pull models, based on a followers count
Future improvements
We can later add more capabilities to our system:
- introduce instant messaging for users
- add premium features and subscriptions
- followers scopes