Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Facebook’s Newsfeed with Score: 9/10

by alchemy1135

System requirements

Functional:

User Authentication: Users should be able to log in to their accounts securely to access the Newsfeed.
Timeline Generation: The Newsfeed should display a chronological list of posts, images, videos, and status updates from profiles and pages that the user follows.
Post Interactions: Users should be able to like, comment, and share posts within the Newsfeed.
Content Filtering: Users should have the ability to filter the content they see based on preferences, such as friends, pages, or interests.
Real-Time Updates: The Newsfeed should provide real-time updates without the need for manual refresh.
Personalization: The Newsfeed should be personalized based on user interactions and interests.
Advertisement Integration: The platform should integrate advertisements within the Newsfeed based on user preferences and behaviors.
Privacy Settings: Users should be able to control the visibility of their posts in the Newsfeed.
Multi-Media Support: The Newsfeed should support various types of media content, including images, videos, and GIFs.

Non-Functional:

Security: The system should ensure the confidentiality, integrity, and availability of user data. It should implement robust security measures to protect against unauthorized access, data breaches, and cyber-attacks.
Performance: The Newsfeed should be highly responsive and scalable to accommodate a large number of users and content. It should minimize latency and downtime, even during peak usage periods.
Scalability: The system should be designed to scale horizontally and vertically to handle increasing user traffic and data volume over time. It should support efficient resource allocation and load balancing.
Reliability: The system should be highly reliable, with minimal downtime and data loss. It should implement backup and disaster recovery mechanisms to ensure data integrity and availability.
Compliance: The platform should comply with relevant data protection regulations, such as GDPR (General Data Protection Regulation) and COPPA (Children's Online Privacy Protection Act). It should also adhere to industry standards and best practices for privacy and security.
Performance Monitoring: The system should include monitoring tools to track performance metrics such as response time, throughput, and error rates. It should generate alerts for abnormal behavior and allow for proactive maintenance and optimization.

Capacity estimation

Total get_timeline() requests:

Total get_timeline() requests per day = Daily active users * Requests per user per day

Total get_timeline() requests per day = 500 million * 10 = 5 billion requests per day

Storage required for posts in memory:

Assuming each user has 500 posts in their news feed, and we maintain 100 posts from each user's news feed in memory on average, assuming it takes 10KB to store the post metadata:

Total storage required = Total users * Average posts per user * Average storage per post

Total storage required = 1.5 billion * 100 * 10KB

Total storage required = 15 TB

Requests per second accessing the Newsfeed:

Requests per second = Total get_timeline() requests per day / (24 hours * 3600 seconds)

Requests per second = 5 billion / (24 * 3600) ≈ 57,870 requests per second

Peak traffic load:

Peak traffic load would occur during peak usage hours, typically during the day. Assuming peak traffic is double the average traffic:

Peak traffic load = Average requests per second * 2

Peak traffic load ≈ 57,870 * 2 ≈ 115,740 requests per second

Media Storage

Let us assume,

Assume 1 billion users, with 500 million as daily active users.
Assume 60 million photos and 35 million videos are shared on Instagram per day.
Consider 3 MB as the maximum size of each photo and 150 MB as the maximum size of each video uploaded on Instagram.
On average, each user sends 20 requests (of any type) per day to our service.

Total Storage Required Per Day

Photos: 60 million photos/day * 3 MB = 180 TeraBytes / day

Videos: 35 million videos/day * 150 MB = 5250 TB / day

Total content size = 180 + 5250 = 5430 TB

The Total Space Required for a Year:

5430 TB/day * 365 (days a year) = 1981950 TB = 1981.95 PetaBytes

Acceptable response time for loading the Newsfeed: users should be able to see their refreshed feed within 5 seconds, this includes getting all the metadata for 100 posts and downloading media for top 20 posts.

API design

get_news_feed_api:
Description: Retrieves the personalized Newsfeed content for the logged-in user.
Input: Authentication token for user identification and any optional parameters for filtering or pagination.
Output: A list of posts, images, videos, and status updates tailored to the user's preferences and social network connections.
get_explore_page_feed:
Description: Fetches trending or popular content from across the platform for exploration.
Input: Optional parameters such as geographical location, category interests, or trending topics.
Output: A curated selection of posts, images, videos, and status updates that are currently trending or popular among users, without personalized filtering.
get_reels_feed:
Description: Retrieves a feed of short video clips, known as "Reels," for entertainment and discovery.
Input: Optional parameters such as geographical location, content categories, or user preferences.
Output: A collection of short video clips, typically 15 to 60 seconds long, from various users and creators, presented in a vertical scrollable format for easy viewing and interaction.

Database design

Database Choices

Feed Data (Newsfeed, Reels, Explore):
Database: NoSQL (Document-Oriented Database like MongoDB)
Reasoning: NoSQL databases offer flexible schemas and horizontal scalability, making them well-suited for storing feed data with varying content types and user preferences.
CAP Theorem Focus: Balanced (Availability and Partition Tolerance)
User Data:
Database: SQL (Relational Database)
Reasoning: SQL databases are suitable for structured data like user profiles, which require ACID (Atomicity, Consistency, Isolation, Durability) properties for data integrity and consistency.
CAP Theorem Focus: Consistency Focused
Posts, Likes, Comments, Notifications:
Database: SQL (Relational Database)
Reasoning: SQL databases are ideal for relational data with complex queries and transactions, making them suitable for storing posts, likes, comments, and notifications associated with users and posts.
CAP Theorem Focus: Consistency Focused
Caching (e.g., for User Profiles, Feed Data):
Database: Redis (In-memory Data Store)
Reasoning: Redis provides high-speed data access and caching capabilities, making it suitable for caching frequently accessed user profiles and feed data to improve performance.
CAP Theorem Focus: Availability Focused
Analytics Data (e.g., User Interactions, Engagement Metrics):
Database: Hadoop (Distributed File System and Processing)
Reasoning: Hadoop is designed for storing and processing large volumes of unstructured data, making it suitable for analytics tasks such as tracking user interactions and deriving engagement metrics.
CAP Theorem Focus: Balanced (Availability and Partition Tolerance)

Data Partitioning

The best partitioning strategy for this problem is Hash Partitioning. Hash Partitioning evenly distributes data across multiple partitions based on a hash function applied to a partitioning key, ensuring uniform data distribution and efficient load balancing, which is crucial for handling large volumes of user-generated content in a social media platform like Facebook. The consistent hashing algorithm can be used for Hash Partitioning, providing flexibility and scalability.

We will use the below columns for Partitioning

Users Table, Feed Tables : user_id

Friends Table : friendship_id - created using user ids

Likes Table : like_id, user_id, post_id

Comments Table : comment_id, user_id, post_id

Sharding

Horizontal partitioning strategy can be applied to tables like Users, Posts, and Friends, where data distribution based on geographical location can optimize data access for users in different regions.

Read/Write Separation

Read/Write Separation can be beneficial for a platform like Facebook to improve scalability, performance, and reliability. By separating read and write operations, the system can optimize resource allocation, handle heavy read traffic efficiently, and ensure high availability for both read and write operations. This approach also allows for scaling read and write components independently, enabling better utilization of resources and minimizing the impact of write-heavy operations on read performance.

High-level design

Below are the components needed to solve the problem end-to-end for a social media platform like Facebook:

User Service:
Manages user authentication, registration, and profile management.
Components: User Database, User Cache, Rate Limiters.
Post Service:
Handles post creation, retrieval, likes, comments, and privacy settings.
Components: Post Database, Posts Cache.
Feed Service:
Generates personalized Newsfeed content for each user based on their preferences and social connections.
Components: Feed Generation Service, Feed Cache, Newsfeed Ranking Service.
Notification Service:
Manages user notifications for likes, comments, friend requests, etc.
Components: Notification Database, Notification Queue, Notification Workers.
Friend Service:
Manages friend connections, friend requests, and friend lists.
Components: Friend Database, Friend Cache.
Web Servers and Load Balancers:
Handle incoming HTTP requests, distribute traffic, and ensure high availability.
Components: Web Servers, Load Balancers.
Fanout Service:
Distributes posts and notifications to followers/friends of a user.
Components: Fanout Workers, Message Queue.
Feed Ranking Service:
Ranks posts in the Newsfeed based on relevance, engagement, and user preferences.
Components: Newsfeed Ranking Algorithm, Ranking Service.
Analytics Service:
Tracks user interactions, engagement metrics, and system performance.
Components: Analytics Database, Analytics Dashboard.
Monitoring and Logging:
Monitors system health, performance metrics, and logs events for troubleshooting and analysis.
Components: Monitoring Tools, Logging Infrastructure.
Caching Layer:
Improves performance by caching frequently accessed data such as user profiles, posts, and feed content.
Components: User Cache, Posts Cache, Feed Cache.
Database Management System (DBMS):
Stores and manages structured data such as user profiles, posts, likes, comments, and notifications.
Components: User Database, Post Database, Notification Database.

Request flows

Below is the sequence diagram which shows what happens when user requests refresh for the news feed.

Detailed component design

Feed Generation

Generating feed immediately when the user requests for it might not be possible since there would be millions of active users at any hour and the latency will be very high, so we will have to pre-compute the feed for each user and store it in out News feed cache database.

We can have dedicated servers that are continuously generating users’ newsfeed even though the user is offline. So even though the user had not opened Facebook application, Facebook has already precomputed the newsfeed beforehand. If a friend uploads a post, then the newsfeed will be updated with the addition of the post.

We will call this service Newsfeed service. Newsfeed service will always be busy keeping the newsfeed up-to-date irrespective of whether you are online or offline. But if the user is not an active user, this service will stop generating latest feed for them.

Whenever there is a new post, the following steps are followed:

The Posts service updates its database with the post contents. i.e. the posts metadata is stored in the posts-metadata database and the media contents are stored in the object storage.
Posts service notifies about the new post to Newsfeed service .
Newsfeed service sends a request to the Connections service to obtain followers of the user who made the post.
Connections service fetches all followers of the user_id and serves the result to the Newsfeed service.
The Newsfeed service updates the newsfeed of the followers of the user who had uploaded the post.

We will be using cache to have in-memory storage for newsfeed. Storing precomputed newsfeed in-memory would optimize the time required to serve get_timeline() request. Thus, we introduce an in-memory Newsfeed cache which will be updated by the Newsfeed service on the addition of new posts. LRU mechanism can be used to invalidate data in the cache. Thus, the newsfeed of users who don't access newsfeed frequently will be removed from the cache and thus won’t occupy space unnecessarily. Redis can be used here as in-memory cache.

Keeping different types of caches

Short term cache : A timeline of recent activity is frequently invalidated because it is changing all the time as you perform actions through your life.
Long term cache : A query cache is kept in Memcached. The results of large queries, like the ranking of all your activities in 2010, can be efficiently cached since they will rarely be invalidated.

Publishing Newsfeed:

Let's add another service called Feedpublisher service to serve precomputed newsfeed when requested by a user. Feedpublisher service will work on fetching the precomputed newsfeed from the Newsfeed cache and make it available to the client as a result.

Whenever a user uploads a new post, we need to update the newsfeed of every connection this user has. As we know, Newsfeed service works on updating the newsfeed of every connection of the user uploading the post and stores the updated newsfeed in the Newsfeed cache , Lets see how this service works internally.

Let’s say A uploads a new post and B is A’s friend. I.e connection. So we have a Connections service that has this connection stored in its database.

Option 1: Pull model / Fan out on load

New posts are added into the newsfeed when the connection requests for the newsfeed.

Here, when A uploads a new post, the newsfeed of every connection of A is NOT updated with this new post until they request for it. Let’s say B is a connection of A.

A’s new post will be added to B’s timeline, when B requests for its newsfeed. Until then a list of all posts which have to be added in the B’s old cached newsfeed is maintained. Let’s call this Newsfeed buffer .

Problem:

New data might not be available until user requests for it.
Many times, there would be no new update in the newsfeed and an empty response will be served by our servers. This way our resources will be used just to send an empty response.

Option 2: Push model / Fan out on write

This way the server pushes the newly uploaded post to every connection’s newsfeed and doesn’t wait for users(connections) to request for it. Thus the precomputed newsfeed is always kept updated. Hence if a new post is uploaded, it is incorporated in the newsfeed of every connection. Let’s say we use a Fanout service to fan out the post to every connection once it is uploaded.

Here, when A uploads a new post, the Fanout service will push this post to the newsfeed of every A’ connection.

Problem: Some users have millions of followers. Pushing to every one of these millions of followers will keep the server busy to a great extent.

Option 3: Hybrid Model :

If users have comparatively less number of followers, we can use the push model. Pushing to less number of followers is not a problem. While if a given user with millions of followers uploads a new post, we can use a pull model for them. That is, followers of these users will be updated on the new post whenever the followers requests for the newsfeed. This way the server won't be busy sending the new post to all the millions of followers. Pushing to users with less number of followers is affordable.

How do we provide an infinite scrolling feed of posts?

One of the main surfaces on social networking platforms is the Feed consisting of an infinite scrolling feed of posts. We can implement this by loading an initial batch of posts and then loading additional batches as the user scrolls down the feed. However, we don’t want the user to wait every time they get to the bottom of the feed (while we load a new batch of posts), so it’s very important for the user experience that we load in new batches before the user hits the end of their current feed.

Feed Ranking

To rank posts in a user's feed effectively, various ranking algorithms can be employed, each utilizing different metadata items to prioritize content based on user preferences, engagement metrics, and relevance.

Relevance-based Ranking Algorithm : This algorithm considers factors such as textual similarity between posts and user interests, post popularity, and recency. This algorithm may use metadata items like post content, user interactions (likes, comments), and timestamps to calculate relevance scores for each post.
Collaborative Filtering Algorithm : This algorithm leverages user interactions and preferences to recommend posts similar to those liked or interacted with by the user's social connections. This algorithm utilizes metadata items such as user profiles, post interactions, and social graph data to identify posts with high collaborative filtering scores.
Popularity-based Ranking Algorithm : This algorithm prioritizes posts based on their popularity metrics, such as the number of likes, comments, and shares. This algorithm relies on metadata items like engagement metrics and user interactions to rank posts by their perceived popularity among users. By incorporating these and other ranking algorithms, social media platforms can deliver personalized and engaging content to users, enhancing their overall experience.

Message Queue and Fanout Services/Workers

Utilizing message queues and fanout services plays a pivotal role in building a scalable and asynchronous event-driven architecture for a social media platform. Message queues act as intermediaries, enabling decoupled communication between various components of the system by storing and routing messages asynchronously. This allows for seamless scaling of services and handling of bursts in traffic by decoupling the production and consumption of events.

Fanout services facilitate efficient broadcasting of events to multiple consumers, ensuring that real-time updates, such as post creations, likes, and comments, are delivered to relevant users without delay.

By leveraging message queues and fanout services, the system can achieve high availability, fault tolerance, and responsiveness, enabling it to handle diverse workloads and evolving user interactions effectively while maintaining scalability and reliability.

Caches for storing feed information

Distributed caching mechanisms like Redis provides several benefits for a social media platform. Redis serves as an in-memory data store, offering high-speed data access and caching capabilities crucial for improving system performance and scalability. By storing frequently accessed data, such as user profiles, posts, and feed content, Redis reduces the latency of read operations and alleviates the load on backend databases.

Redis also supports various data structures and features like replication, partitioning, and clustering, enabling horizontal scaling and fault tolerance. Leveraging Redis for caching can significantly enhance user experience by delivering real-time updates, reducing response times, and ensuring consistent performance even during peak usage periods.

Machine Learning

Introducing machine learning models for content recommendations can significantly enhance user experiences by leveraging data-driven insights to personalize content recommendations tailored to individual user preferences and behavior patterns. By analyzing user interactions, content attributes, and contextual information, machine learning algorithms can predict user interests and preferences, leading to more relevant and engaging content recommendations.

Various algorithms can be used for content recommendation analysis, including:

Collaborative Filtering: Collaborative filtering techniques, such as user-based or item-based collaborative filtering, analyze user interactions and similarities between users or items to make personalized recommendations. These algorithms recommend items to users based on the preferences and behaviors of similar users or items.
Content-Based Filtering: Content-based filtering methods recommend items similar to those that a user has interacted with in the past. These algorithms analyze the attributes of items and users' historical interactions to identify patterns and recommend content that matches users' preferences.
Matrix Factorization: Matrix factorization techniques, such as Singular Value Decomposition (SVD) or Alternating Least Squares (ALS), decompose the user-item interaction matrix into lower-dimensional matrices to capture latent factors representing user preferences and item attributes. These algorithms make personalized recommendations by predicting missing entries in the interaction matrix.
Deep Learning Models: Deep learning models, such as neural networks and recurrent neural networks (RNNs), can learn complex patterns and representations from large-scale data for content recommendation tasks. These models can incorporate sequential user behavior data, textual content, and multimedia features to make personalized recommendations.

Trade offs/Tech choices

We've opted for a microservices architecture to improve scalability and enable independent service development, but this choice comes with the overhead of managing inter-service communication and deployment complexity.

Additionally, we've chosen to use a combination of SQL and NoSQL databases to meet the diverse data storage and querying needs of the platform, trading off strict consistency for flexible data modeling and horizontal scalability.

Furthermore, we've decided to implement event-driven architecture using message queues and fanout services to enable real-time updates and notifications, sacrificing simplicity for asynchronous and scalable communication between system components.

Overall, these trade-offs and tech choices aim to optimize the platform's performance, reliability, and development agility while mitigating potential drawbacks through careful architectural design and implementation.

Future improvements

Introduce Personalization Engine: Implementing a sophisticated personalization engine powered by machine learning algorithms would enable more granular user segmentation and personalized content recommendations, thereby enhancing user engagement and satisfaction.
Enhance Real-Time Analytics: Incorporating advanced real-time analytics capabilities using technologies like Apache Flink or Apache Kafka Streams would enable the platform to derive actionable insights from streaming data, facilitating more timely decision-making and proactive user engagement strategies.