Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Facebook’s Newsfeed

by celestial_orbit290

System requirements

Functional:

Whenever the user logs into the app, the user's newsfeed will be generated based on the posts from the people, group, and pages that a user follows
A user may have multiple friends, follows multiple pages, groups and users
The user gets new posts on the go, if he is active on the app.
User's news feed may contain images, text or videos.

Non-Functional:

System must generate any user's newsfeed in real-time. Latency should not be more than 2s.
Any post must not take more than 5s to reflect on a user's newsfeed, assuming the user has requested it.

Capacity estimation

Assuming a user has 100 friends and follows 100 pages.

Traffic estimates: Assuming 300M DAU, each user fetching newsfeed 5 times a day.

=> 30,00,00,000x5 = 1,50,00,00,000 = 1.5B requests per day or in 24 hours

=> 1.5B/(60x60x24) = 17500 requests per second

Storage estimates:

Considering 500 posts to be added to each news feed, and each post is 1kb in nature.

=> per user we need 500kb data storage

So for DAU, we need

500 x 300M x 1kb = 150000000000 x 1000

= 150TB

Assuming 1 server can keep 100GB, then we will be needing 1500 machines to keep the top 500 posts in memory for all active users.

API design

getNewsFeed (userId, since_id, max_id, count)

user_id (number): The ID of the user for whom the system will generate the newsfeed.

since_id (number): returns from where the newsfeed must be returned, the last catch point

count (number): Optional; specifies the number of feed items to try and retrieve, for e.g. 200 distinct posts

max_id (number): Optional; returns results up to this id.

Returns: (JSON) Returns a JSON object containing a list of feed items.

Database design

For a newsfeed, we have 3 primary entities

user --> primary key as user_id (number)
entity (user, page or group) --> primary key as entity_id (number)
feed_item (post, image, or video) --> primary key as feedItem_id

user: user_id, user_name, dob, lastLoginTime

entity: entity_id, entity_name, category, creation, description

feed_item: feed_item_id, feed_item_metadata, entity_id, numOfLikes, creationDate

user_feed_item_relation_table

user_entity_relation_table

Since user can have lot of friends and can also belong or follow a group or page, we will have the nXn relations described in a separate table, and keep the relational database.

High-level design

This problem can be divided into 2 parts:

Feed generation: For every new user that logs into the app, we need to generate the newsfeed from the posts of users and entities that a user follows. We can do this in 2 ways again
Prepare the feed for the user asynchronously and maintain it in a user_news_feed table. Whenever user logs in, or opens up the app, we look into the table, and fetch the news feed already calculated for the user.
For the DAU/ frequent users, we can keep this generated feed on cache as well and also generate feed a bit more regularly than others. We can have a ranking system for this.

=> To generate the feed for a user, following steps can be performed

Retrieve IDs of all users and entities that a user follows.
Retrieve latest, most popular and relevant posts for those IDs. These are the potential posts that we can show in that user's newsfeed.
Rank these posts based on the relevance to the user. This represents user's current feed.
Store this feed in the cache and return top posts (say 20) to be rendered on user's feed.
On the front-end, when user reaches the end of her current feed, she can fetch the next 20 posts from the server and so on.

Feed publishing: Whenever user loads her newsfeed page, he has to request and pull feed items from the server. When he reaches the end of his current feed, he can pull more data from the server. For newer items either the server can notify Jane and then she can pull, or the server can push, these new posts.

We will discuss these options in detail later.

At a high level, we will need following components in our Newsfeed service:

1. Web servers: To maintain a connection with the user. This connection will be used to transfer

data between the user and the server.

2. Application server: To execute the workflows of storing new posts in the database servers. We

will also need some application servers to retrieve and to push the newsfeed to the end user.

3. Metadata database and cache: To store the metadata about Users, Pages, and Groups.

4. Posts database and cache: To store metadata about posts and their contents.

5. Video and photo storage, and cache: Blob storage, to store all the media included in the posts.

6. Newsfeed generation service: To gather and rank all the relevant posts for a user to generate

newsfeed and store in the cache. This service will also receive live updates and will add these

newer feed items to any user’s timeline.

7. Feed notification service: To notify the user that there are newer items available for their

newsfeed.

Following is the high-level architecture diagram of our system. User B and C are following User A.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Sharding posts and metadata - Since we have a huge number of new posts every day and our read load is extremely high too, we need to distribute our data onto multiple machines such that we can read/write it efficiently.
Sharding feed data - For feed data, which is being stored in memory, we can partition it based on UserID. We can try storing all the data of a user on one server. When storing, we can pass the UserID to our hash function that will map the user to a cache server where we will store the user’s feed objects. Also, for any given user, since we don’t expect to store more than 500 FeedItmeIDs, we will not run into a scenario where feed data for a user doesn’t fit on a single server. To get the feed of a user, we would always have to query only one server.

Trade offs/Tech choices

Initially, we can decide to store 500 feed items per user, but this number can be adjusted later based on the usage pattern. For example, if we assume that one page of a user’s feed has 20 posts and most of the users never browse more than ten pages of their feed, we can decide to store only 200 posts per user. For any user who wants to see more posts (more than what is stored in memory), we can always query backend servers.
There will be a lot of users that don’t login frequently. Here are a few things we can do to handle this;
a more straightforward approach could be, to use a LRU based cache that can remove users from memory that haven’t accessed their newsfeed for a long time
a smarter solution can figure out the login pattern of users to pre-generate their newsfeed, e.g., at what time of the day a user is active and which days of the week does a user access their newsfeed? etc.

Failure scenarios/bottlenecks

We generated the feed once and cached it. However, what about new posts coming from individuals whom the user follows? In the event the user is online, we need a mechanism to evaluate and incorporate these new posts into his feed. To accomplish this, we can implement a process where we periodically (for example, every five minutes) rank and integrate the latest posts into his feed. Subsequently, the user can be alerted to the presence of new items in his feed, which he can then retrieve

Future improvements

For future growth and replication, we can use Consistent Hashing for sharding data.