Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Facebook’s Newsfeed with Score: 8/10

by iridescent_luminous693

System requirements

Functional:

Dynamic Newsfeed:

Display a dynamic feed of posts (text, images, videos, status updates) from profiles/pages a user follows.
Include posts from friends, pages, and groups the user is part of.

Content Ranking:

Rank posts based on relevance, user engagement, and recency using algorithms.
Include factors like likes, comments, shares, and mutual connections.

Personalisation:

Customise the feed for each user based on preferences, interactions, and past behavior.
Show content based on user interests and geographic location.

Real-Time Updates:

Push new posts to the user's feed in real time without requiring a page refresh.

Engagement:

Enable users to like, comment, share, or react to posts directly from the newsfeed.
Allow saving posts for later.

Multi-Media Support:

Display posts with various content types: text, images, videos, live streams, and links.

Pagination or Infinite Scrolling:

Allow users to scroll infinitely or paginate to load more content as needed.

Ad Integration:

Include sponsored posts or ads, blended seamlessly into the feed.

Filtering:

Enable users to filter content by categories like friends, groups, pages, or time.

Search:

Allow searching for specific posts, hashtags, or topics from the newsfeed.

Non-Functional:

Performance:

Ensure low latency (< 2 seconds) for loading the initial newsfeed.
Support fast response times for real-time updates.

Scalability:

Handle millions of concurrent users globally.
Scale horizontally to manage increasing data volume and traffic.

Availability:

Ensure high availability (99.99% uptime SLA).
Implement fallback mechanisms for partial functionality during outages.

Consistency:

Provide eventual consistency for feed updates (new posts, likes, comments).
Ensure consistency between actions (e.g., like counts and user engagement metrics).

Security:

Authenticate user requests to ensure access only to relevant posts.
Protect sensitive user data and post visibility based on privacy settings.

Maintainability:

Modular design to accommodate changes like new post types or ranking algorithms.
Provide clear logging and monitoring for debugging.

Localization:

Support multiple languages and local content prioritization based on user location.

Reliability:

Implement retries and backups for critical features (e.g., post submission or feed ranking).
Use redundant systems to prevent data loss.

Ad Personalization:

Integrate personalized ads without degrading performance.
Ensure adherence to user privacy settings for ad targeting.

Content Moderation:

Include mechanisms to flag and filter inappropriate or harmful content in real-time.

Capacity estimation

Assumptions

User Base:
- Total users: 2 billion.
- Daily active users (DAU): 50% (1 billion users).
- Peak concurrent users: 10% of DAU (100 million).
Posts:
- Average posts per user per day: 5.
- Average size of a post: 1 KB (text), 500 KB (image), 5 MB (video).
- Post distribution: 70% text, 20% image, 10% video.
Feed Size:
- Average posts fetched per feed: 100.
- Average post size: 1 MB.
Engagement:
- 50% of users engage with the feed daily.
- Average interactions per user: 10/day.

Capacity Estimation

Data Storage Requirements

Daily posts:
1 billion users/day × 5 posts/user/day = 5 billion posts/day.
Daily storage:
- Text: 5 billion × 70% × 1 KB = 3.5 TB/day.
- Image: 5 billion × 20% × 500 KB = 500 TB/day.
- Video: 5 billion × 10% × 5 MB = 2.5 PB/day.
- Total: 3.0035 PB/day (~1.1 Exabytes/year).

Data Transfer and Throughput

Feed data transfer:
1 billion users × 100 MB/feed = 100 PB/day.
Engagement data transfer:
1 billion users × 10 interactions × 1 KB = 10 TB/day.

Request Throughput

Feed requests:
1 billion/day ÷ 86400 seconds = ~11,574 RPS.
Engagement requests:
10 billion/day ÷ 86400 seconds = ~115,740 RPS.

API design

User and Authentication APIs

POST /api/users/register
- Description: Register a new user account with email and password.
POST /api/users/login
- Description: Authenticate a user and generate an access token.
GET /api/users/{userId}/profile
- Description: Retrieve the profile details of a specific user.
PUT /api/users/{userId}/preferences
- Description: Update user preferences, such as interests and settings.

Newsfeed APIs

GET /api/feed
- Description: Retrieve the personalized newsfeed for the authenticated user.
- Query Parameters: limit, offset, filter.
GET /api/feed/updates
- Description: Fetch real-time updates (new posts) for the user's feed.
POST /api/feed/filter
- Description: Apply specific filters to the user's feed (e.g., by friends, groups).
GET /api/feed/saved
- Description: Retrieve posts the user has saved for later viewing.

Post APIs

POST /api/posts
- Description: Create a new post with text, images, or videos.
- Request: Multipart data for content and media.
GET /api/posts/{postId}
- Description: Retrieve the details of a specific post.
PUT /api/posts/{postId}
- Description: Update the content or metadata of an existing post.
DELETE /api/posts/{postId}
- Description: Delete a user's post.

Engagement APIs

POST /api/posts/{postId}/like
- Description: Like a specific post.
POST /api/posts/{postId}/comment
- Description: Add a comment to a specific post.
GET /api/posts/{postId}/comments
- Description: Retrieve all comments on a specific post.
- Query Parameters: limit, offset.
POST /api/posts/{postId}/share
- Description: Share a specific post to the user's timeline.
GET /api/posts/{postId}/reactions
- Description: Retrieve all reactions (likes, loves, etc.) for a specific post.

Search APIs

GET /api/search/posts
- Description: Search for posts by keywords, hashtags, or content type.
- Query Parameters: query, type, limit, offset.
GET /api/search/users
- Description: Search for users by name or username.
- Query Parameters: query, limit, offset.
**GET /api/search/groups`
- Description: Search for groups by name or topics.

Ad APIs

GET /api/ads
- Description: Fetch personalized sponsored posts for the user's feed.
POST /api/ads/click/{adId}
- Description: Record a click on a specific ad for analytics.

Analytics APIs

GET /api/analytics/feed
- Description: Retrieve user-specific analytics for feed interactions (e.g., views, clicks).
GET /api/analytics/post/{postId}
- Description: Retrieve engagement metrics for a specific post.
GET /api/analytics/user/{userId}
- Description: Retrieve a user's activity summary (e.g., posts, likes, shares).

Database design

Relational Database (MySQL/PostgreSQL)

Users Table
- Schema:
  - user_id (Primary Key): Unique identifier for the user.
  - username: Display name of the user.
  - email: Email address of the user.
  - password_hash: Encrypted password.
  - profile_picture: URL to the user's profile picture.
  - preferences: JSON object storing user preferences.
  - created_at: Timestamp when the user account was created.
- Technology: MySQL or PostgreSQL.
- Reason: Relational databases are ideal for structured data with strong consistency, such as user profiles and preferences.
Posts Table
- Schema:
  - post_id (Primary Key): Unique identifier for the post.
  - user_id (Foreign Key): References Users.user_id.
  - content: Text content of the post.
  - media_url: URL(s) for attached images/videos.
  - visibility: Post visibility settings.
  - created_at: Timestamp of post creation.
- Technology: MySQL or PostgreSQL.
- Reason: Posts require structured storage with relational consistency to associate users with their posts.
Likes Table
- Schema:
  - like_id (Primary Key): Unique identifier for the like.
  - post_id (Foreign Key): References Posts.post_id.
  - user_id (Foreign Key): References Users.user_id.
  - reaction_type: Type of reaction (e.g., like, love, wow).
  - created_at: Timestamp when the reaction was added.
- Technology: MySQL or PostgreSQL.
- Reason: High write consistency is needed to track reactions accurately, making relational databases suitable.
Comments Table
- Schema:
  - comment_id (Primary Key): Unique identifier for the comment.
  - post_id (Foreign Key): References Posts.post_id.
  - user_id (Foreign Key): References Users.user_id.
  - comment_text: Text of the comment.
  - created_at: Timestamp of comment creation.
- Technology: MySQL or PostgreSQL.
- Reason: Hierarchical relationships between posts and comments are efficiently managed by relational databases.
Shares Table
- Schema:
  - share_id (Primary Key): Unique identifier for the share.
  - post_id (Foreign Key): References Posts.post_id.
  - user_id (Foreign Key): References Users.user_id.
  - created_at: Timestamp when the post was shared.
- Technology: MySQL or PostgreSQL.
- Reason: Structured relationships between shares, users, and posts benefit from relational storage.

NoSQL Database (MongoDB/Elasticsearch)

Feed Collection
- Schema:
  - user_id: ID of the user.
  - feed: List of post objects with post_id and relevance_score.
  - last_updated: Timestamp when the feed was last refreshed.
- Technology: MongoDB.
- Reason: Flexible schema and high write throughput make MongoDB ideal for storing precomputed feeds.
Search Index Collection
- Schema:
  - post_id: ID of the post.
  - keywords: List of keywords extracted from the post content.
  - user_id: ID of the post's creator.
  - tags: List of hashtags/topics related to the post.
- Technology: Elasticsearch.
- Reason: Optimized for full-text search, Elasticsearch is ideal for high-performance query execution.
Post Metadata Collection
- Schema:
  - post_id: ID of the post.
  - views: Total number of views.
  - likes_count: Total number of likes.
  - comments_count: Total number of comments.
  - shares_count: Total number of shares.
- Technology: MongoDB.
- Reason: Handles high-volume updates (views, likes) with scalability for unstructured, frequently changing data.

Blob Storage (AWS S3/Google Cloud Storage)

Media Bucket

Schema:
- Directory structure: /media/{user_id}/{post_id_or_profile}/{file_name}
- Stores:
  - Images: Profile pictures, post images, and other static images.
  - Videos: Transcoded video files in multiple resolutions.
- Examples:
  - Profile Picture: /media/{user_id}/profile/{file_name}
  - Post Media: /media/{user_id}/{post_id}/{file_name}
Technology: AWS S3 or Google Cloud Storage.
Reason:
- Consolidating profile pictures into the media bucket simplifies the storage structure.
- Reduces the need to maintain separate buckets for small files, as S3/Cloud Storage is highly scalable and efficient for mixed storage types.
- Profile pictures are static, and their retrieval can be optimized through caching or CDNs for fast delivery.

Caching (Redis/Memcached)

Cached Feed Data
- Schema:
  - Key: feed:{user_id}
  - Value: List of precomputed feed posts.
  - Expiry: 5-10 minutes.
- Technology: Redis.
- Reason: In-memory caching provides high-speed access to frequently queried data.
Popular Posts
- Schema:
  - Key: popular_posts
  - Value: List of trending post IDs.
- Technology: Redis.
- Reason: Low latency for frequently accessed trending data.
User Sessions
- Schema:
  - Key: session:{user_id}
  - Value: Authentication token and user profile data.
- Technology: Redis.
- Reason: Fast session retrieval for real-time authentication.

Analytics Data Store (Cassandra/Snowflake)

Engagement Analytics
- Schema:
  - post_id: ID of the post.
  - views: Total number of views.
  - unique_users: Count of unique users interacting with the post.
  - avg_time_spent: Average time spent on the post.
- Technology: Cassandra.
- Reason: Cassandra handles high-volume time-series data with scalability.
User Activity Analytics
- Schema:
  - user_id: ID of the user.
  - posts_created: Total posts created by the user.
  - likes_given: Total likes by the user.
  - comments_made: Total comments by the user.
- Technology: Snowflake.
- Reason: Snowflake efficiently supports large-scale analytical queries and complex aggregations.

High-level design

1. Client Application

Purpose: Provides user interfaces for accessing the system via web browsers, mobile apps, or desktop apps.
Key Actions:
- Users can view feeds, create posts, like or comment, search content, and view analytics.
Technologies: Frontend frameworks like ReactJS, Angular, or Swift for mobile apps.

2. Load Balancer

Purpose: Distributes incoming traffic evenly across multiple instances of the API Gateway to ensure scalability and fault tolerance.
Technologies: AWS Application Load Balancer, NGINX, or HAProxy.

3. API Gateway

Purpose: Acts as the single entry point for all client requests, handling routing, authentication, rate limiting, and monitoring.
Technologies: AWS API Gateway, Kong Gateway, or Apigee.

4. Backend Services

Feed Service:
- Purpose: Generates and fetches user-specific feeds based on relevance, engagement, and user preferences.
- Database: MongoDB for storing precomputed feeds.
- Key Technologies: Node.js or Python for backend logic.
Post Service:
- Purpose: Manages posts, including creation, updates, deletions, and associated media files.
- Database: PostgreSQL for structured metadata and AWS S3 for media storage.
- Key Technologies: Django, Spring Boot, or Express.js.
Engagement Service:
- Purpose: Tracks likes, comments, and shares, and provides engagement metrics.
- Database: PostgreSQL for relational data and Redis for caching frequently accessed metrics.
- Key Technologies: Java with Spring Boot, or Python with Flask.
Search Service:
- Purpose: Handles full-text and keyword searches for posts, users, and groups.
- Database: Elasticsearch for indexing and querying.
- Key Technologies: Python with FastAPI, or Java with Elasticsearch client.
User Service:
- Purpose: Manages user profiles, authentication, and preferences.
- Database: PostgreSQL for relational user data.
- Key Technologies: Ruby on Rails, or Node.js.
Analytics Service:
- Purpose: Tracks and aggregates engagement and view data for generating insights and metrics.
- Databases:
  - Apache Kafka for real-time event streaming.
  - DynamoDB for aggregated metrics.
  - AWS S3 for raw analytics data.
- Key Technologies: Apache Spark, or Flink for processing.

5. Databases

MongoDB: Stores dynamic, user-specific feeds for the Feed Service.
PostgreSQL: Stores structured metadata for posts, users, and engagement (likes, comments, shares).
AWS S3: Stores media files (images, videos) and raw analytics data.
Redis: Caches frequently accessed engagement metrics for low-latency reads.
Elastic search: Provides fast full-text search and indexing for the Search Service.
DynamoDB: Stores aggregated analytics metrics for fast, scalable reads.

6. Other Infrastructure Components

Authentication:
- Manages secure access using JWT or OAuth2 tokens.
- Implemented via API Gateway or a dedicated Auth Service.
Monitoring and Logging:
- Tools like AWS CloudWatch, Prometheus, and ELK Stack for tracking system health and logs.
Content Delivery:
- Uses a CDN (e.g., AWS CloudFront) for fast delivery of media files.

Request flows

1. Fetch Personalized Feed

Objective: Retrieve a user-specific feed based on their preferences, interactions, and content relevance.
Flow:
- Client requests the feed via the API Gateway.
- The API Gateway routes the request to the Feed Service.
- The Feed Service queries MongoDB for the precomputed feed.
- The feed data is returned to the client for display.
Key Components: Client → API Gateway → Feed Service → MongoDB.

2. Create New Post

Objective: Allow users to create posts with text and media (images or videos).
Flow:
- Client sends a request to create a post via the API Gateway.
- The API Gateway routes the request to the Post Service.
- The Post Service:
  - Stores metadata (e.g., title, content) in PostgreSQL.
  - Uploads associated media files to AWS S3.
- A success response is sent back to the client.
Key Components: Client → API Gateway → Post Service → PostgreSQL → AWS S3.

3. Like a Post

Objective: Allow users to like or react to a post.
Flow:
- Client sends a like request via the API Gateway.
- The API Gateway routes the request to the Engagement Service.
- The Engagement Service:
  - Adds the like record to PostgreSQL.
  - Updates the cached like count in Redis for quick future access.
- A success response is sent back to the client.
Key Components: Client → API Gateway → Engagement Service → PostgreSQL → Redis.

4. Comment on a Post

Objective: Enable users to comment on posts.
Flow:
- Client sends a comment request via the API Gateway.
- The API Gateway routes the request to the Engagement Service.
- The Engagement Service inserts the comment data into PostgreSQL.
- A success response is returned to the client.
Key Components: Client → API Gateway → Engagement Service → PostgreSQL.

5. Search for Posts

Objective: Allow users to search for posts, users, or tags using keywords.
Flow:
- Client sends a search query via the API Gateway.
- The API Gateway forwards the query to the Search Service.
- The Search Service queries Elasticsearch for matching results.
- The search results are returned to the client.
Key Components: Client → API Gateway → Search Service → Elasticsearch.

6. Fetch Post Analytics

Objective: Provide analytics data (e.g., views, likes, comments) for a specific post.
Flow:
- Client sends a request for post analytics via the API Gateway.
- The API Gateway routes the request to the Analytics Service.
- The Analytics Service:
  - Streams real-time event data from Kafka.
  - Queries DynamoDB for aggregated metrics (e.g., total likes, comments).
  - Retrieves archived raw data from AWS S3 if needed.
- Analytics data is returned to the client.
Key Components: Client → API Gateway → Analytics Service → Kafka → DynamoDB → AWS S3.

Detailed component design

1. Feed Service

Purpose

To provide personalized feeds combining precomputed results and real-time updates based on user preferences, interactions, and content relevance.

Detailed Workflow

Event Triggering:
- User actions (likes, comments, shares, follows) or new post creations generate events.
- These events are pushed into an Apache Kafka event queue.
- Kafka brokers distribute the events to Feed Service consumers for processing.
Precomputation:
- Periodically, the Feed Service retrieves:
  - Posts from users/pages the user follows.
  - Interaction metrics (e.g., likes, shares, comments).
  - User preferences (e.g., preferred categories, frequently interacted users/pages).
- The Ranking Algorithm assigns scores to each post based on: R=w1⋅Engagement+w2⋅Recency+w3⋅PersonalizationR = w_1 \cdot Engagement + w_2 \cdot Recency + w_3 \cdot PersonalizationR=w1⋅Engagement+w2⋅Recency+w3⋅Personalization
- Posts with the highest scores are stored in MongoDB as a precomputed feed for each user.
Real-Time Updates:
- New events are processed dynamically and updated in Redis using a priority_queue structure.
- The queue prioritizes posts with higher relevance scores.
Feed Retrieval:
- When a user requests their feed:
  - The system first queries Redis for cached updates.
  - If a cache miss occurs, the system fetches precomputed feed data from MongoDB.
  - Real-time updates (Redis) and precomputed data (MongoDB) are merged, ensuring relevance and freshness.

Scalability

MongoDB is sharded by user_id for distributed storage and retrieval.
Redis reduces latency for frequently accessed feeds.
Kafka ensures high throughput and scalable event processing.

2. Post Service

Purpose

Manages post creation, storage, retrieval, and associated media handling.

Detailed Workflow

Post Creation:
- Users submit post content and media via the client app.
- Metadata (e.g., text content, creation time, visibility settings) is sent to the Post Service API.
- Metadata is stored in PostgreSQL, ensuring ACID compliance.
Media Upload:
- Media files (e.g., images, videos) are split into smaller parts using multipart uploads.
- The client uploads each part directly to AWS S3 using pre-signed URLs, reducing server load.
- After all parts are uploaded, S3 assembles them into a complete file and generates a URL.
Post Retrieval:
- When a user requests a post:
  - Metadata is retrieved from PostgreSQL.
  - The media URL is fetched from S3.
- Media delivery is optimized using AWS CloudFront (CDN) for caching.
Post Deletion:
- Metadata is removed from PostgreSQL.
- Corresponding media files are deleted from S3 using their unique post_id.

Scalability

PostgreSQL read replicas handle read-heavy operations efficiently.
AWS S3 scales horizontally for unlimited media storage.
CloudFront reduces latency by caching media close to the user.

3. Engagement Service

Purpose

Tracks and processes interactions such as likes, comments, and shares.

Detailed Workflow

Like Interaction:
- A user likes a post via the client app.
- The like is recorded in PostgreSQL, associating the post_id with the user_id.
- Redis updates the like count, storing it as a hash_map keyed by post_id for fast retrieval.
Comment Interaction:
- Comments are submitted through the client app and stored in PostgreSQL, linked to the post and user.
- Comment threads are cached in Redis for frequently accessed posts.
- If a user requests comments, the system queries Redis first, falling back to PostgreSQL if necessary.
Share Interaction:
- Shares are recorded in PostgreSQL, linking the post_id and user_id.
- The Feed Service is notified via Kafka to update the feeds of the sharer’s followers.

Scalability

Redis reduces latency for engagement metrics like like counts and comment threads.
PostgreSQL is partitioned by post_id to distribute data efficiently.

4. Search Service

Purpose

Enables users to perform full-text and keyword searches for posts, users, and tags.

Detailed Workflow

Indexing:
- When a new post is created, its content, tags, and associated metadata are sent to Elasticsearch.
- Elasticsearch creates an inverted index, mapping terms to document IDs for efficient search.
Search Query:
- The client sends a search query through the API Gateway.
- The query is translated into Elasticsearch Query DSL, specifying fields to search (e.g., content, tags).
- Elasticsearch matches the query terms to its inverted index and ranks results by relevance.
Response:
- Results are paginated and sorted before being sent back to the client.

Scalability

Elasticsearch clusters handle horizontal scaling, distributing data across nodes.
Batched indexing ensures high write efficiency.

5. User Service

Purpose

Manages user profiles, authentication, and preferences.

Detailed Workflow

User Profile Management:
- User profile data (e.g., name, email, preferences) is stored in PostgreSQL.
- Updates to preferences trigger Kafka events to update dependent services like the Feed Service.
Authentication:
- Users log in with credentials verified by the User Service.
- On successful login, a JWT (JSON Web Token) is issued, containing claims like user_id and roles.
- The client uses this token for subsequent authenticated requests.
Preference Retrieval:
- User preferences are fetched during feed generation to tailor content recommendations.

Scalability

PostgreSQL read replicas handle frequent profile and preference lookups.
Redis caches session data for fast authentication.

6. Analytics Service

Purpose

Tracks and processes engagement and view data for generating real-time and aggregated insights.

Detailed Workflow

Real-Time Streaming:
- Engagement events (e.g., likes, views) are streamed to Kafka.
- Kafka partitions these events by post_id or user_id, enabling parallel processing.
Aggregation:
- Apache Spark processes Kafka streams, computing metrics like total views, average time spent, and engagement rates.
- Aggregated metrics are stored in DynamoDB for fast retrieval.
Archiving:
- Raw events are periodically archived in AWS S3 for batch processing or long-term storage.
Metrics Retrieval:
- Analytics data is fetched from DynamoDB for dashboards or reports.
- Real-time data can be fetched directly from Spark for the latest metrics.

Scalability

Kafka’s partitioning ensures high-throughput event processing.
DynamoDB scales horizontally for fast reads/writes.

1. Feed Update Mechanism

How Incoming Posts are Prioritized

Prioritization Strategy:
- New posts are scored based on a real-time ranking algorithm using factors such as:
  - Recency: Weight posts created recently higher.
  - Engagement: Prioritize posts with higher interaction (likes, comments, shares).
  - Relevance: Align with the user’s preferences (e.g., topics, followed accounts).
- Posts are inserted into a priority queue in Redis, sorted by the calculated score.

Push Notification Strategy

Real-Time Notifications:
- WebSockets: Maintain persistent connections to push updates instantly to active users.
- FCM/APNs: For mobile apps, use Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNs) for out-of-app notifications.
Triggering Notifications:
- Push notifications are triggered via Kafka events when:
  - A new post is created by a followed user/page.
  - Significant engagement occurs on a user’s post.

Mechanism for Feed Updates:

New events (e.g., post creation) are pushed into Kafka.
The Feed Service processes these events to update the priority queue in Redis.
Active users receive immediate updates via WebSockets or notifications.
Precomputed feeds in MongoDB are updated periodically to incorporate recent posts.

2. User Preferences Management

Storage and Management

Storage:
- Preferences are stored in PostgreSQL as a JSONB field for flexibility (e.g., {"topics": ["sports", "technology"], "language": "en"}).
- Indexed fields in PostgreSQL allow faster querying for specific preference attributes.
Real-Time Updates:
- Updates to user preferences trigger a Kafka event, notifying dependent services like the Feed Service to regenerate personalized feeds.

Utilization in Feed Personalization

Integration into Ranking:
- The ranking algorithm assigns higher weights to posts aligned with the user’s preferences (e.g., topics, categories, frequently interacted users/pages).
Dynamic Adaptation:
- Machine learning models adjust weights over time based on evolving user behavior.
- For example, a user engaging more with “technology” posts will see more tech content.

End-to-End Flow:

Users update preferences via the client application.
Updates are saved in PostgreSQL and trigger Kafka events.
The Feed Service recalculates the user’s feed with the updated preferences.

3. Data Partitioning for Scalability

Sharding Strategies

User Data:
- Shard by user_id: Each shard contains data for a set of users.
- Rationale: Distributes load evenly across shards for user-specific queries like feed retrieval or engagement metrics.
Post Data:
- Shard by post_id or creation date: Ensures high-write workloads (e.g., post creation) are evenly distributed.
- Rationale: Supports horizontal scaling and reduces contention on hot partitions.
Engagement Data:
- Shard by post_id: Related engagement data (likes, comments, shares) resides together.
- Rationale: Enables efficient queries for post-level metrics.

Database Partitioning Tools

Use MongoDB’s native sharding for user-specific feeds.
Leverage PostgreSQL table partitioning (e.g., declarative range partitioning by post_id or timestamp) for relational data.

Advantages:

Improves query performance by reducing the amount of data scanned.
Scales horizontally to support millions of users and posts.

4. Caching Strategy Detail

Utilization of Redis

Cached Data:
- Frequently accessed data, such as:
  - User feeds (priority queue with top-ranked posts).
  - Engagement metrics (like counts, comment threads).
  - Session data for authenticated users.
Cache Structure:
- Priority Queue: Maintains top-ranked posts for fast feed retrieval.
- Hash Maps: Store like counts or engagement metrics keyed by post_id.

Cache Invalidation Rules

Event-Driven Invalidation:
- Cache entries are invalidated or updated when:
  - A new post is added by a followed user.
  - Engagement metrics (e.g., like counts) change.
  - User preferences are updated, requiring feed recalculations.
Time-Based Invalidation:
- Use a TTL (Time-to-Live) for cache entries to automatically refresh outdated data.
Lazy Updates:
- For low-priority data, allow cache misses to trigger updates from the database.

Mitigating Stale Data

Redis updates occur in real-time for high-priority events (e.g., new posts, likes).
Backfill mechanisms ensure precomputed feeds in MongoDB are periodically synced with Redis.

Trade offs/Tech choices

1. Database Choices

Relational (PostgreSQL) for structured data needing ACID compliance.
NoSQL (MongoDB, DynamoDB) for scalable, high-throughput operations like feeds and analytics.
Trade-Off: NoSQL sacrifices ACID for scalability; relational databases are less scalable but ensure consistency.

2. Media Storage

AWS S3 for large media files, with CloudFront (CDN) for caching.
Trade-Off: S3 introduces initial retrieval latency, mitigated by CDN.

3. Search

Elasticsearch for full-text and keyword searches.
Trade-Off: Requires operational overhead but offers fast, scalable search capabilities.

4. Caching

Redis for in-memory caching of feeds, engagement metrics, and sessions.
Trade-Off: Requires fallback for cache misses but significantly reduces database load.

5. Event Streaming

Kafka for real-time event processing and feed updates.
Trade-Off: Adds complexity but enables scalable, low-latency updates.

6. Analytics

DynamoDB for aggregated metrics and S3 for raw data storage.
Trade-Off: DynamoDB has limited query flexibility but scales horizontally for fast reads/writes.

7. Feed Service

Precomputation with MongoDB and real-time updates with Redis.
Trade-Off: Slight delay in real-time updates for reduced computation overhead.

8. Authentication

JWTs for stateless authentication.
Trade-Off: Revocation is harder than session-based systems but ensures scalability.

9. Push vs. Pull Model for Feed

Pull Model:
- Used for initial feed requests when a user logs in or refreshes.
- Reduces server-side resource usage during low activity.
Push Model:
- Used for real-time updates (e.g., new posts, interactions).
- Implemented with WebSockets or SSE for instant delivery.
Hybrid Approach:
- Combines pull for initial loading and push for updates, ensuring scalability and a responsive user experience.

Failure scenarios/bottlenecks

1. Database Failures

Scenario: Primary database (PostgreSQL/MongoDB) downtime.
- Impact: Data inconsistency, inability to retrieve posts, feeds, or engagement metrics.
- Mitigation:
  - Use read replicas for failover.
  - Set up multi-region replication for MongoDB and DynamoDB.
  - Implement database backup and restore mechanisms.

2. Cache Inconsistencies

Scenario: Redis fails or serves stale data.
- Impact: Delays in retrieving likes, comments, or precomputed feeds.
- Mitigation:
  - Use cache invalidation policies for updates.
  - Fall back to querying the database on cache misses.
  - Deploy Redis in a clustered mode with failover support.

3. Event Streaming Failures

Scenario: Kafka broker downtime or consumer lag.
- Impact: Delayed real-time feed updates and analytics aggregation.
- Mitigation:
  - Enable replication of Kafka topics.
  - Set up dead letter queues to handle unprocessed events.
  - Use consumer lag monitoring to scale consumers dynamically.

4. Search System Failures

Scenario: Elasticsearch cluster overload or node failure.
- Impact: Inability to process search queries efficiently.
- Mitigation:
  - Scale Elasticsearch with auto-scaling clusters.
  - Enable snapshot backups for index recovery.
  - Use a circuit breaker pattern to fail gracefully during overloads.

5. Media Storage Issues

Scenario: AWS S3 outage or slow media retrieval.
- Impact: Media files (images, videos) fail to load or cause delays.
- Mitigation:
  - Use multi-region replication in S3.
  - Rely on CDN (CloudFront) for cached media delivery.
  - Set up fallback mechanisms for default media placeholders.

6. High Latency During Peak Traffic

Scenario: Spikes in traffic overwhelm APIs and databases.
- Impact: Increased response times, service outages.
- Mitigation:
  - Use rate limiting and auto-scaling for API Gateway and backend services.
  - Leverage Redis caching to reduce database load.
  - Implement load balancing with health checks for backend instances.

7. Feed Generation Bottlenecks

Scenario: Precomputation jobs lag or real-time updates fail.
- Impact: Users see outdated or incomplete feeds.
- Mitigation:
  - Use priority queues to process high-priority feed updates first.
  - Scale feed services horizontally to handle high volumes.
  - Monitor job processing times and reassign lagging tasks dynamically.

8. Authentication Token Exploits

Scenario: JWTs are compromised or used after expiration.
- Impact: Unauthorized access to user data.
- Mitigation:
  - Use short-lived tokens with refresh tokens.
  - Employ token revocation lists and monitor suspicious activity.
  - Secure tokens using encryption and HTTPS.

9. CDN Failures

Scenario: CloudFront or another CDN fails to serve cached media.
- Impact: Increased latency as media retrieval falls back to S3.
- Mitigation:
  - Use multi-CDN setups to switch between providers.
  - Ensure S3 is configured for direct fallback delivery.

10. Analytics Processing Delays

Scenario: Spark jobs fail or Kafka streams are blocked.
- Impact: Delayed insights and inaccurate analytics data.
- Mitigation:
  - Enable checkpointing in Spark for job recovery.
  - Scale Kafka partitions and consumer groups dynamically.
  - Use real-time monitoring tools like Kafka Manager.

11. API Gateway Overload

Scenario: API Gateway cannot handle incoming traffic during peaks.
- Impact: Request throttling or dropped connections.
- Mitigation:
  - Use auto-scaling and load balancers for API Gateway.
  - Enable rate limiting to manage traffic bursts.
  - Implement regional gateways for better load distribution.

12. Engagement Metric Inaccuracies

Scenario: Redis cache corruption or database write failures.
- Impact: Incorrect like/comment counts displayed to users.
- Mitigation:
  - Periodically sync Redis cache with the database.
  - Use transactional writes in PostgreSQL for engagement metrics.

Future improvements

Database Scalability:

Add sharding or use distributed SQL (e.g., CockroachDB).
Multi-region backups for resilience.

Cache Resilience:

Deploy Redis Sentinel/Cluster for failover and redundancy.

Event Streaming:

Use Kafka Streams for fault-tolerant processing.
Implement cross-region replication with MirrorMaker.

Media Delivery:

Multi-CDN strategy (e.g., CloudFront + Akamai).
Preload popular media for faster access.

Feed Generation:

Use ML-based ranking for better personalization.
Combine batch and real-time processing.

Search Optimization:

Enable hot and cold indexing in Elasticsearch.
Cache frequent queries to reduce cluster load.

Analytics:

Real-time pipelines with Apache Flink or BigQuery.
Store metrics in time-series databases like TimescaleDB.

Authentication:

Use OAuth 2.0 with refresh tokens.
Implement adaptive authentication for anomaly detection.

API Scalability:

Introduce GraphQL for flexible data fetching.
Use gRPC for high-performance communication.

Disaster Recovery:

Set up region-based failover systems.
Monitor with tools like Datadog and Prometheus.