Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Facebook Messenger with Score: 8/10

by iridescent_luminous693

System requirements

Functional:

User Management:

Allow users to register, log in, and manage profiles.
Support friend requests and contacts syncing.

Messaging:

Real-time text messaging between users.
Support for group chats and broadcast messages.
Message delivery status (sent, delivered, read receipts).
Typing indicators for active conversations.

Multimedia Support:

Share images, videos, voice notes, and documents.
Preview media files before sending.

Push Notifications:

Notify users of new messages when the app is inactive.

Search:

Search for users and conversations.
Search within chats for specific messages or attachments.

Security:

End-to-end encryption for messages.
Multi-factor authentication for login.

Message History:

Store and retrieve chat history.
Option to archive, delete, or mute conversations.

Presence Indicators:

Show online/offline status.
Indicate last seen or active status.

Cross-Platform Support:

Synchronize messages across devices (web and mobile).

Settings:

Allow customization of notifications, privacy settings, and themes.

Non-Functional:

Performance:

Deliver messages with low latency (< 200ms for real-time delivery).
Support high throughput during peak traffic.

Scalability:

Handle millions of active users simultaneously.
Scale horizontally for increasing user load.

Reliability:

Ensure message delivery guarantees (at least once).
High availability with 99.99% uptime.

Consistency:

Maintain consistency of message delivery across all devices.

Security:

Use secure protocols (e.g., TLS) for data transmission.
Encrypt messages with E2EE to prevent unauthorized access.

Fault Tolerance:

Gracefully handle server or network failures.
Provide retry mechanisms for undelivered messages.

Usability:

Provide an intuitive and responsive user interface.
Ensure accessibility for differently-abled users.

Data Durability:

Ensure that chat history and multimedia are not lost in case of failures.
Use redundant storage for critical data.

Low Resource Usage:

Optimize mobile app performance to minimize battery and data usage.

Compliance:

Adhere to data privacy laws (e.g., GDPR, CCPA).
Provide users with data export and account deletion options.

Capacity estimation

1. Message Traffic

Daily Active Users (DAU): ~50% of MAU = 25 million users.
Average Messages per User per Day: 50 messages.
Total Messages per Day: 25 million users×50 messages=1.25 billion messages/day25 \, \text{million users} \times 50 \, \text{messages} = 1.25 \, \text{billion messages/day}25million users×50messages=1.25billion messages/day
Peak Traffic:
- 5% of daily messages are sent in the peak hour: 1.25 billion×0.05=62.5 million messages/hour1.25 \, \text{billion} \times 0.05 = 62.5 \, \text{million messages/hour}1.25billion×0.05=62.5million messages/hour
- Messages per Second (MPS): 62.5 million÷3600 seconds=17,361 MPS62.5 \, \text{million} \div 3600 \, \text{seconds} = 17,361 \, \text{MPS}62.5million÷3600seconds=17,361MPS

2. Message Size

Average Message Size: ~1 KB (text, metadata, delivery receipts).
Daily Storage Requirement: 1.25 billion messages/day×1 KB=1.25 TB/day1.25 \, \text{billion messages/day} \times 1 \, \text{KB} = 1.25 \, \text{TB/day}1.25billion messages/day×1KB=1.25TB/day
Monthly Storage Requirement: 1.25 TB/day×30=37.5 TB/month1.25 \, \text{TB/day} \times 30 = 37.5 \, \text{TB/month}1.25TB/day×30=37.5TB/month

3. Multimedia Traffic

Percentage of Multimedia Messages: ~20% of total messages.
Multimedia Uploads per Day: 1.25 billion messages/day×0.2=250 million multimedia files/day1.25 \, \text{billion messages/day} \times 0.2 = 250 \, \text{million multimedia files/day}1.25billion messages/day×0.2=250million multimedia files/day
Average Multimedia File Size: ~500 KB.
Daily Storage for Multimedia: 250 million files/day×500 KB=125 TB/day250 \, \text{million files/day} \times 500 \, \text{KB} = 125 \, \text{TB/day}250million files/day×500KB=125TB/day
Monthly Storage for Multimedia: 125 TB/day×30=3.75 PB/month125 \, \text{TB/day} \times 30 = 3.75 \, \text{PB/month}125TB/day×30=3.75PB/month

4. Push Notifications

Push Notifications per Day:
- For 25 million DAU, assuming each receives ~50 notifications/day: 25 million users×50 notifications=1.25 billion notifications/day25 \, \text{million users} \times 50 \, \text{notifications} = 1.25 \, \text{billion notifications/day}25million users×50notifications=1.25billion notifications/day
Peak Notifications per Second (NPS):
- Assuming 5% are sent in the peak hour: 1.25 billion×0.05÷3600=17,361 NPS1.25 \, \text{billion} \times 0.05 \div 3600 = 17,361 \, \text{NPS}1.25billion×0.05÷3600=17,361NPS

5. User Metadata

User Profiles:
- Average profile size: ~1 KB (basic information, preferences, settings).
- Storage for 50 million users: 50 million users×1 KB=50 GB50 \, \text{million users} \times 1 \, \text{KB} = 50 \, \text{GB}50million users×1KB=50GB
User Connections:
- Average friends/contacts per user: ~200.
- Total connections: 50 million users×200=10 billion connections50 \, \text{million users} \times 200 = 10 \, \text{billion connections}50million users×200=10billion connections
- Average connection metadata size: ~100 bytes.
- Total storage for connections: 10 billion connections×100 bytes=1 TB10 \, \text{billion connections} \times 100 \, \text{bytes} = 1 \, \text{TB}10billion connections×100bytes=1TB

6. Storage for Message History

Retention Policy: Store messages for 1 year.
Annual Storage Requirement: 37.5 TB/month×12 months=450 TB/year (text messages only)37.5 \, \text{TB/month} \times 12 \, \text{months} = 450 \, \text{TB/year (text messages only)}37.5TB/month×12months=450TB/year (text messages only) 3.75 PB/month×12 months=45 PB/year (multimedia files)3.75 \, \text{PB/month} \times 12 \, \text{months} = 45 \, \text{PB/year (multimedia files)}3.75PB/month×12months=45PB/year (multimedia files)

7. API Requests

Assumptions:
- 10 API requests per user per session (login, fetch messages, send messages, fetch media).
- Average session duration: 15 minutes.
API Requests per Day: 25 million DAU×10 requests/session=250 million API requests/day25 \, \text{million DAU} \times 10 \, \text{requests/session} = 250 \, \text{million API requests/day}25million DAU×10requests/session=250million API requests/day
Peak API Requests per Second (RPS):
- 5% of requests occur in the peak hour: 250 million×0.05÷3600=3,472 RPS250 \, \text{million} \times 0.05 \div 3600 = 3,472 \, \text{RPS}250million×0.05÷3600=3,472RPS

8. Bandwidth Requirements

Message Bandwidth:
- Average message size: 1 KB.
- Peak MPS: 17,361.
- Bandwidth for messages: 17,361 MPS×1 KB= 17 MBps (136 Mbps)17,361 \, \text{MPS} \times 1 \, \text{KB} = ~17 \, \text{MBps (136 Mbps)}17,361MPS×1KB= 17MBps (136 Mbps)
Multimedia Bandwidth:
- Average multimedia file size: 500 KB.
- Peak multimedia messages per second (20% of total MPS): 17,361×0.2=3,472 files/second17,361 \times 0.2 = 3,472 \, \text{files/second}17,361×0.2=3,472files/second
- Bandwidth for multimedia: 3,472 files/sec×500 KB= 1.7 GBps (13.6 Gbps)3,472 \, \text{files/sec} \times 500 \, \text{KB} = ~1.7 \, \text{GBps (13.6 Gbps)}3,472files/sec×500KB= 1.7GBps (13.6 Gbps)

API design

User Management APIs

POST /users/register
- Description: Register a new user with details like name, email, and password.
POST /users/login
- Description: Authenticate a user and return a JWT for session management.
POST /users/logout
- Description: Log out the user and invalidate the session token.
GET /users/{userId}/profile
- Description: Fetch the profile details of a specific user.
PUT /users/{userId}/profile
- Description: Update the user’s profile information.

Messaging APIs

POST /messages/send
- Description: Send a message to a user or group.
GET /messages/conversation/{conversationId}
- Description: Retrieve the message history for a specific conversation.
POST /messages/read-receipt
- Description: Update the read receipt status for a message.
POST /messages/typing
- Description: Notify the server when a user is typing in a conversation.

Group Management APIs

POST /groups/create
- Description: Create a new group chat.
POST /groups/{groupId}/add-user
- Description: Add a user to a specific group.
POST /groups/{groupId}/remove-user
- Description: Remove a user from a specific group.
GET /groups/{groupId}/details
- Description: Fetch details of a specific group, including members.
DELETE /groups/{groupId}
- Description: Delete a group chat.

Media Handling APIs

POST /media/upload
- Description: Upload a media file (image, video, document) to the server.
GET /media/{mediaId}
- Description: Retrieve a media file by its ID.
DELETE /media/{mediaId}
- Description: Delete a specific media file.

Search APIs

GET /search/users
- Description: Search for users by name or email.
GET /search/messages
- Description: Search within a conversation for specific keywords.

Push Notification APIs

POST /notifications/subscribe
- Description: Register a device for push notifications.
DELETE /notifications/unsubscribe
- Description: Unregister a device from receiving push notifications.

Presence and Status APIs

GET /presence/{userId}
- Description: Fetch the online/offline status of a user.
POST /presence/update
- Description: Update the user’s presence status (e.g., online, offline, busy).

Administrative APIs

GET /admin/stats
- Description: Retrieve system statistics like active users, messages sent, and storage usage.
POST /admin/ban-user
- Description: Ban a specific user from the platform.

Database design

1. User Database

Purpose: Stores user profile information and preferences for managing authentication, personalization, and interactions.

Database Details:
- Table: users:
  - user_id (UUID, Primary Key): Unique identifier for each user.
  - name (VARCHAR): Full name of the user.
  - email (VARCHAR, Unique): User’s email address.
  - password_hash (VARCHAR): Encrypted password.
  - profile_picture_url (TEXT): URL to the user’s profile picture.
  - status (ENUM): Current user status (e.g., active, banned).
  - created_at (TIMESTAMP): Account creation timestamp.
- Table: user_preferences:
  - user_id (UUID, Foreign Key): Links to the users table.
  - preferences (JSONB): Stores user-specific settings (e.g., notification preferences, themes).
Technology Used:
- PostgreSQL.
Reason:
- ACID compliance ensures consistent user data.
- JSONB allows flexible storage of user preferences.

2. Message Database

Purpose: Manages the storage and retrieval of messages, including their relationships with users and conversations.

Database Details:
- Table: messages:
  - message_id (UUID, Primary Key): Unique identifier for each message.
  - conversation_id (UUID, Foreign Key): Links to the conversations table.
  - sender_id (UUID, Foreign Key): Links to the users table.
  - content (TEXT): Message content (plaintext or JSON for rich text).
  - media_url (TEXT, Nullable): URL for attached media.
  - status (ENUM): Message delivery status (e.g., sent, delivered, read).
  - created_at (TIMESTAMP): Timestamp when the message was sent.
- Table: conversations:
  - conversation_id (UUID, Primary Key): Unique identifier for each conversation.
  - type (ENUM): Type of conversation (e.g., one-to-one, group).
  - participants (JSONB): List of user IDs in the conversation.
Technology Used:
- PostgreSQL.
Reason:
- Relational database supports efficient message querying and history retrieval.
- JSONB allows flexible participant storage for group chats.

3. Media Storage

Purpose: Stores multimedia files and their associated metadata for seamless sharing in messages.

Database Details:
- No traditional database table for media storage.
- AWS S3 stores media files, while a metadata table tracks their relationships:
  - Table: media_metadata:
    - media_id (UUID, Primary Key): Unique identifier for the media file.
    - message_id (UUID, Foreign Key): Links to the messages table.
    - url (TEXT): Direct link to the media file in S3.
    - media_type (ENUM): Type of media (e.g., image, video, document).
    - size (INT): Size of the media file in bytes.
Technology Used:
- AWS S3 (for storage) and PostgreSQL (for metadata).
Reason:
- S3 offers scalable, cost-efficient storage for large multimedia files.
- PostgreSQL ensures quick metadata access for efficient linking with messages.

4. Notification Database

Purpose: Tracks and manages push notifications for user messages and updates.

Database Details:
- Table: notifications:
  - notification_id (UUID, Primary Key): Unique identifier for each notification.
  - user_id (UUID, Foreign Key): Links to the users table.
  - message (TEXT): Notification message content.
  - status (ENUM): Status of the notification (e.g., pending, sent, failed).
  - created_at (TIMESTAMP): Timestamp when the notification was created.
  - sent_at (TIMESTAMP, Nullable): Timestamp when the notification was sent.
Technology Used:
- PostgreSQL.
Reason:
- PostgreSQL ensures reliable and trackable notifications with delivery status for audits.

5. Search Index

Purpose: Enables fast and scalable search for users, messages, and conversations.

Database Details:
- Indexed Fields:
  - User search: name, email (from users table).
  - Message search: content, conversation_id (from messages table).
- Indexed data is updated asynchronously for real-time search capability.
Technology Used:
- Elasticsearch.
Reason:
- Elasticsearch provides optimized full-text and keyword search capabilities.
- Handles large-scale query loads efficiently with horizontal scaling.

6. Session Store

Purpose: Manages user sessions, ensuring fast authentication and session validation.

Database Details:
- Table: sessions:
  - session_id (UUID, Primary Key): Unique session identifier.
  - user_id (UUID, Foreign Key): Links to the users table.
  - device (VARCHAR): User’s device information.
  - created_at (TIMESTAMP): Timestamp of session creation.
  - expires_at (TIMESTAMP): Session expiration timestamp.
Technology Used:
- Redis.
Reason:
- In-memory database ensures low-latency session retrieval.
- Built-in TTL support automatically handles session expiration.

7. Analytics Database

Purpose: Tracks user engagement, platform usage, and generates aggregated reports.

Database Details:
- Table: engagement_metrics:
  - metric_id (UUID, Primary Key): Unique identifier for each metric.
  - user_id (UUID, Foreign Key): Links to the users table.
  - event_type (ENUM): Type of engagement (e.g., sent_message, received_message).
  - timestamp (TIMESTAMP): Timestamp of the event.
- Table: usage_statistics:
  - stat_id (UUID, Primary Key): Unique identifier for the statistic.
  - daily_active_users (INT): Count of DAU.
  - messages_sent (INT): Total messages sent per day.
  - media_shared (INT): Total multimedia files shared per day.
  - date (DATE): Date of the statistic.
Technology Used:
- DynamoDB (for real-time metrics) and PostgreSQL (for aggregated reports).
Reason:
- DynamoDB’s horizontal scaling supports high-volume real-time event ingestion.
- PostgreSQL allows efficient storage and querying of long-term statistics.

High-level design

1. Client Applications

Purpose: Provide interfaces for users to interact with the system on web and mobile platforms.
Functions:
- Send and receive messages.
- Upload and retrieve multimedia files.
- Display notifications and real-time chat updates.
- Allow user management, including profile updates and preferences.
Interaction: Communicates with the API Gateway for all requests and responses.

2. API Gateway

Purpose: Acts as the entry point for all client requests.
Functions:
- Routes requests to appropriate backend services.
- Enforces authentication, rate limiting, and request validation.
- Provides a layer of abstraction to manage backend service scaling.
Interaction: Handles communication between clients and services like Messaging, Media, and User services.

3. Authentication Service

Purpose: Manages user authentication and session validation.
Functions:
- Authenticate users using credentials.
- Issue JWTs for secure session management.
- Validate JWTs for subsequent requests.
Interaction: Interacts with the User Database and Redis for session management.

4. Messaging Service

Purpose: Handles the sending, receiving, and storage of messages.
Functions:
- Store and retrieve messages from the database.
- Maintain real-time updates for message delivery status (sent, delivered, read).
- Support group chats with multiple participants.
Interaction: Reads and writes to Messages and Conversations tables in PostgreSQL.

5. Media Service

Purpose: Manages the storage and retrieval of multimedia files.
Functions:
- Handle file uploads to AWS S3.
- Generate and serve pre-signed URLs for secure media access.
- Link media files with corresponding messages.
Interaction: Interacts with Media Metadata Table in PostgreSQL and AWS S3 for file storage.

6. Notification Service

Purpose: Sends real-time notifications to users about messages and updates.
Functions:
- Push notifications via WebSockets, FCM, or APNs.
- Track notification statuses (pending, sent, failed).
- Trigger notifications based on events (e.g., new messages, friend requests).
Interaction: Uses the Notifications Table in PostgreSQL to manage and log notifications.

7. Search Service

Purpose: Provides full-text and keyword search for users, messages, and conversations.
Functions:
- Index data from the users and messages tables for efficient search.
- Perform search queries based on user inputs.
- Return results ranked by relevance.
Interaction: Interfaces with Elasticsearch for indexing and querying.

8. User Service

Purpose: Manages user profiles, preferences, and relationships.
Functions:
- Store and update user profile data.
- Manage user preferences, including notification settings.
- Provide details about user connections (friends, groups).
Interaction: Uses Users Table and User Preferences Table in PostgreSQL.

9. Analytics Service

Purpose: Tracks user engagement and platform metrics for insights and reporting.
Functions:
- Record real-time events like message sends and media uploads.
- Aggregate data for usage statistics (e.g., DAU, total messages sent).
- Provide dashboards for system administrators.
Interaction: Stores real-time data in DynamoDB and aggregates long-term statistics in PostgreSQL.

10. Databases

Purpose: Provide structured and scalable data storage for the system.
Components:
- PostgreSQL:
  - Users, messages, conversations, notifications, media metadata, and preferences.
- Redis:
  - Session storage and caching for real-time interactions.
- Elasticsearch:
  - Search indexing for messages and user data.
- AWS S3:
  - Scalable storage for multimedia files.
- DynamoDB:
  - Real-time metrics for analytics.

Request flows

1. Login Request

Flow:
1. The client sends a login request to the API Gateway with user credentials.
2. The API Gateway forwards the request to the Authentication Service.
3. The Authentication Service queries the User Database to validate the credentials.
4. If valid, the User Database returns user details.
5. The Authentication Service generates a JWT token and sends it back to the API Gateway.
6. The API Gateway returns the JWT to the client, confirming successful login.
Purpose:
- Authenticate the user and issue a token for secure session management.
Components Involved: Client, API Gateway, Authentication Service, User Database.

2. Fetch Conversation List

Flow:
1. The client requests the list of conversations from the API Gateway.
2. The API Gateway forwards the request to the Messaging Service.
3. The Messaging Service queries the Messages Table to fetch the user’s conversations.
4. The Messages Table returns the list of conversations to the Messaging Service.
5. The Messaging Service sends the conversation list to the API Gateway.
6. The API Gateway delivers the list to the client.
Purpose:
- Retrieve the list of active conversations for the logged-in user.
Components Involved: Client, API Gateway, Messaging Service, Messages Table.

3. Send a Message

Flow:
1. The client sends a message to the API Gateway.
2. The API Gateway forwards the request to the Messaging Service.
3. The Messaging Service saves the message to the Messages Table.
4. Once saved, the Messaging Service triggers the Notification Service to notify the recipient.
5. The Notification Service logs the notification in the Notifications Table.
6. The Notification Service sends the notification to the Push Service.
7. The Push Service delivers a real-time notification to the recipient client.
8. The Messaging Service confirms successful message delivery to the API Gateway, which informs the client.
Purpose:
- Deliver a message to a recipient while ensuring it is logged and the recipient is notified.
Components Involved: Client, API Gateway, Messaging Service, Messages Table, Notification Service, Notifications Table, Push Service, Recipient Client.

4. Fetch New Messages

Flow:
1. The recipient client requests new messages from the API Gateway.
2. The API Gateway forwards the request to the Messaging Service.
3. The Messaging Service queries the Messages Table for new messages.
4. The Messages Table returns the new messages to the Messaging Service.
5. The Messaging Service sends the messages to the API Gateway.
6. The API Gateway delivers the new messages to the client.
Purpose:
- Allow users to fetch new messages and update their conversation history.
Components Involved: Recipient Client, API Gateway, Messaging Service, Messages Table.

5. Upload Media

Flow:
1. The client sends a media upload request to the API Gateway.
2. The API Gateway forwards the request to the Media Service.
3. The Media Service uploads the media file to AWS S3.
4. AWS S3 returns the media URL to the Media Service.
5. The Media Service stores the metadata (e.g., URL, type, size) in the Media Metadata Table.
6. The Media Metadata Table confirms metadata storage.
7. The Media Service sends the media URL to the API Gateway, which informs the client.
Purpose:
- Handle the storage and retrieval of multimedia files for sharing in messages.
Components Involved: Client, API Gateway, Media Service, AWS S3, Media Metadata Table.

6. Search for Users or Messages

Flow:
1. The client sends a search query to the API Gateway.
2. The API Gateway forwards the query to the Search Service.
3. The Search Service queries Elasticsearch for indexed user or message data.
4. Elasticsearch returns the search results to the Search Service.
5. The Search Service sends the results to the API Gateway, which delivers them to the client.
Purpose:
- Enable users to search for specific messages, users, or conversations quickly.
Components Involved: Client, API Gateway, Search Service, Elasticsearch.

7. Fetch Analytics Dashboard

Flow:
1. The client requests usage statistics from the API Gateway.
2. The API Gateway forwards the request to the Analytics Service.
3. The Analytics Service queries DynamoDB for real-time metrics.
4. DynamoDB returns the metrics to the Analytics Service.
5. The Analytics Service queries PostgreSQL for historical statistics.
6. PostgreSQL returns the historical data to the Analytics Service.
7. The Analytics Service sends the combined analytics data to the API Gateway, which delivers it to the client.
Purpose:
- Provide real-time and historical insights into user activity and platform usage.
Components Involved: Client, API Gateway, Analytics Service, DynamoDB, PostgreSQL.

Detailed component design

1. User Login

When a user attempts to log in, the client application sends a request containing the email and password to the API Gateway over a secure HTTPS connection. The API Gateway forwards this request to the Authentication Service, which processes the login attempt. The Authentication Service hashes the provided password using a secure algorithm like bcrypt and queries the User Database to retrieve the stored hashed password and associated user details. It compares the hashed values to verify the credentials. If the credentials are valid, the Authentication Service generates a JWT (JSON Web Token) containing claims like user_id and roles (e.g., admin or regular user). This token is signed and returned to the API Gateway.

The API Gateway sends the JWT to the client, establishing a secure session. Simultaneously, the Authentication Service stores session details (user ID, device information, and token expiration) in Redis with a Time-to-Live (TTL) setting for automatic session invalidation. Redis ensures fast lookups for subsequent authentication checks. If the login fails (e.g., incorrect credentials), the client is notified with an appropriate error. This process ensures robust, scalable, and secure session handling across distributed components, with Redis enabling fast validation of active sessions and JWTs reducing the need for repeated database lookups.

2. Sending a Message

When a user sends a message, the client application sends the message payload (content, recipient ID, optional media links) to the API Gateway, which authenticates the user via the provided JWT. After validation, the message is routed to the Messaging Service, which processes it. The Messaging Service generates a unique message_id and saves the message to the Messages Table in PostgreSQL, ensuring persistence and integrity. If the message contains multimedia content, the Messaging Service interacts with the Media Service, which uploads the file to AWS S3 using multipart uploads. The Media Service generates a URL for the file and stores metadata (e.g., media_id, URL, file type, size) in the Media Metadata Table.

Once the message is stored, the Messaging Service checks the recipient’s online status via the Notification Service. If the recipient is online, a real-time notification is sent through WebSockets using the Push Service. If offline, the Notification Service logs a notification in the Notifications Table and sends a push notification via Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNs). The Messaging Service confirms the successful message delivery to the API Gateway, which informs the sender client. This workflow ensures robust handling of message persistence, real-time delivery, and recipient notifications while integrating multimedia storage seamlessly.

3. Fetching Conversation History

When a user wants to view their conversation history, the client application sends a request to the API Gateway, which validates the JWT and forwards the request to the Messaging Service. The Messaging Service queries the Conversations Table in PostgreSQL to fetch active conversations for the user, returning details like conversation_id, participants, last message, and unread message count. The service then retrieves the recent messages for each conversation from the Messages Table. For performance optimization, Redis is used to cache frequently accessed conversations, significantly reducing latency for repeat requests.

Once the conversation data is aggregated, the Messaging Service sends the response back to the API Gateway, which forwards it to the client. For active conversations, updates (e.g., new messages or changes in status) are managed via WebSockets for real-time synchronization. This process ensures that users can access their conversation history efficiently, with caching improving scalability and reducing database load for high-traffic queries.

4. Uploading Media

When a user uploads a media file (e.g., image or video), the client application sends the file to the API Gateway, which forwards it to the Media Service. The Media Service splits large files into smaller parts using multipart uploads and uploads these chunks to AWS S3. This process ensures efficient handling of large files by retrying failed parts without reuploading the entire file. Once the upload is complete, S3 generates a unique URL for the file, which the Media Service stores in the Media Metadata Table in PostgreSQL alongside metadata like media_id, file type, and size.

The Media Service then returns the media URL to the API Gateway, which forwards it to the client. This URL can be included in a message and shared with other users. The distributed nature of S3 ensures that storage scales seamlessly with the number of files, while the Media Metadata Table maintains efficient linkages between media files and their associated messages.

5. Search for Messages or Users

When a user initiates a search query (e.g., a keyword or contact name), the client application sends the query to the API Gateway, which forwards it to the Search Service. The Search Service translates the query into Elasticsearch DSL (Domain-Specific Language), which enables precise filtering and ranking of results. Elasticsearch searches indexed fields from the users and messages datasets using inverted indexes that map terms to document IDs for fast lookups.

The Search Service aggregates and sorts results by relevance (e.g., keyword frequency, recent updates) before returning them to the API Gateway, which sends them to the client. Frequently searched terms are cached in Redis to improve response times for common queries. This workflow ensures rapid and scalable search capabilities, with Elasticsearch clusters handling large-scale datasets and Redis reducing latency for repeated searches.

6. Fetching Analytics Dashboard

When an admin requests analytics data, the client application sends the request to the API Gateway, which forwards it to the Analytics Service. The Analytics Service retrieves real-time metrics (e.g., active users, messages sent) from DynamoDB, which is optimized for high-throughput event data. Simultaneously, the service queries PostgreSQL for historical data (e.g., monthly engagement trends, average session duration).

The Analytics Service combines real-time and historical data, applies any required aggregation, and sends the processed results back to the API Gateway. The Gateway delivers the analytics dashboard to the admin client. This approach ensures a seamless blend of real-time insights and long-term trends, with DynamoDB handling high-frequency updates and PostgreSQL providing robust historical analysis.

Trade offs/Tech choices

1. API Gateway for Request Routing

Choice: Using a dedicated API Gateway for routing client requests.
Reason:
- Centralized management for rate limiting, authentication, and request routing.
- Simplifies backend service interactions for clients.
Trade-off:
- Adds a single point of failure. If the gateway is down, all requests fail.
- Mitigation: Deploy redundant API Gateway instances with load balancing to ensure high availability.

2. PostgreSQL for Core Data

Choice: Using PostgreSQL for user data, messages, conversations, and metadata.
Reason:
- Relational structure is ideal for managing interconnected data (e.g., users, messages).
- ACID compliance ensures data consistency for critical operations like message storage.
- JSONB fields provide flexibility for unstructured data (e.g., user preferences, group participants).
Trade-off:
- Performance bottlenecks during high concurrent write operations (e.g., millions of messages).
- Mitigation: Use database sharding (e.g., by conversation_id) and read replicas for scalability.

3. Redis for Session Management and Caching

Choice: Using Redis for session storage and frequently accessed data (e.g., active conversations, notifications).
Reason:
- In-memory storage offers low-latency reads and writes, improving performance for real-time operations.
- TTL ensures efficient memory management by expiring stale sessions.
Trade-off:
- Limited by memory size, making it expensive for large-scale caching.
- Mitigation: Use Redis clustering for scalability and offload non-critical cache entries to less expensive solutions.

4. AWS S3 for Media Storage

Choice: Using AWS S3 for storing multimedia files (e.g., images, videos, documents).
Reason:
- Cost-effective and infinitely scalable for large media storage.
- Built-in support for multipart uploads ensures efficient handling of large files.
Trade-off:
- Increased latency when retrieving media compared to in-memory or CDN solutions.
- Mitigation: Use AWS CloudFront (CDN) to cache frequently accessed files closer to users.

5. Elasticsearch for Search

Choice: Using Elasticsearch for full-text search across users and messages.
Reason:
- Optimized for fast search queries using inverted indexes.
- Scales horizontally, supporting large-scale datasets and high query loads.
Trade-off:
- Indexing adds latency to data ingestion, meaning search results may lag behind real-time updates.
- Mitigation: Implement near-real-time indexing and optimize ingestion pipelines for efficiency.

6. DynamoDB for Real-Time Analytics

Choice: Using DynamoDB to store real-time metrics (e.g., active users, messages sent).
Reason:
- Handles high write throughput efficiently with auto-scaling capabilities.
- Provides low-latency reads for real-time dashboards.
Trade-off:
- Querying across multiple dimensions (e.g., by time and user) is limited due to the NoSQL structure.
- Mitigation: Store pre-aggregated metrics for common queries and offload complex analytics to a relational database (PostgreSQL).

7. WebSockets for Real-Time Messaging

Choice: Using WebSockets for real-time communication between users.
Reason:
- Persistent, low-latency connections are ideal for delivering messages, typing indicators, and notifications.
- Reduces overhead compared to frequent HTTP polling.
Trade-off:
- Consumes server resources to maintain persistent connections for active users.
- Mitigation: Use horizontal scaling with WebSocket clusters and load balancing to manage connections efficiently.

8. Kafka for Event Streaming

Choice: Using Kafka for event-driven communication between services (e.g., triggering notifications, feed updates).
Reason:
- Supports high-throughput event streaming and ensures message durability.
- Decouples components, enabling asynchronous processing and better scalability.
Trade-off:
- Complex to manage, requiring careful configuration and monitoring.
- Mitigation: Use managed Kafka services (e.g., Confluent or AWS MSK) to simplify operations.

9. Push Notifications for Offline Users

Choice: Using Firebase Cloud Messaging (FCM) and Apple Push Notification Service (APNs) for offline user notifications.
Reason:
- Offloads notification delivery to reliable third-party services.
- Scales globally without requiring additional infrastructure.
Trade-off:
- Limited customization of delivery mechanisms and dependencies on external providers.
- Mitigation: Combine FCM/APNs with WebSocket notifications for a hybrid model catering to both active and inactive users.

Failure scenarios/bottlenecks

1. API Gateway Failures

Scenario:
- A failure in the API Gateway (e.g., server crash, overload) prevents routing of client requests to backend services.
Impact:
- All client interactions (e.g., messaging, login) are disrupted.
Mitigation:
- Use redundant API Gateway instances behind a load balancer to distribute traffic.
- Implement rate limiting to prevent spikes that could overwhelm the Gateway.
- Monitor Gateway health with tools like AWS CloudWatch or Prometheus and use auto-scaling policies.

2. Database Overload

Scenario:
- High write workloads (e.g., message bursts during peak hours) overload the PostgreSQL database, causing latency or downtime.
Impact:
- Delayed message storage and retrieval, affecting real-time user experience.
Mitigation:
- Implement sharding (e.g., by conversation_id) to distribute writes across multiple database partitions.
- Use read replicas for scaling read-heavy operations like fetching messages or profiles.
- Employ connection pooling (e.g., PgBouncer) to limit resource contention.

3. Redis Memory Exhaustion

Scenario:
- Large volumes of cached data (e.g., sessions, frequently accessed conversations) exceed available Redis memory.
Impact:
- Cache misses lead to increased load on the database, degrading performance.
Mitigation:
- Use TTL (Time-to-Live) policies to evict stale entries automatically.
- Scale Redis clusters horizontally to increase memory capacity.
- Offload less critical cache entries to a less expensive caching layer like Memcached.

4. WebSocket Connection Limits

Scenario:
- WebSocket servers reach their connection limits during peak hours.
Impact:
- Users are unable to receive real-time notifications or messages.
Mitigation:
- Use horizontal scaling with WebSocket clusters and load balancers to distribute connections.
- Implement session stickiness to keep users connected to the same WebSocket server during a session.
- Monitor connection metrics and auto-scale WebSocket servers as needed.

5. Kafka Partition Overload

Scenario:
- A surge in events (e.g., messages, notifications) causes Kafka partitions to become overloaded.
Impact:
- Delayed processing of events, leading to latency in notifications and feed updates.
Mitigation:
- Partition Kafka topics based on user ID or message ID to evenly distribute load.
- Increase the number of partitions for high-traffic topics.
- Use consumer groups to parallelize event processing.

6. AWS S3 Latency

Scenario:
- Increased retrieval requests for media files from S3 cause higher latency.
Impact:
- Delays in loading media files (e.g., images, videos) for users.
Mitigation:
- Use AWS CloudFront (CDN) to cache media files closer to the users’ locations.
- Pre-generate media thumbnails for faster display.
- Implement lifecycle policies to optimize storage costs and reduce latency for frequently accessed media.

7. Elasticsearch Indexing Delays

Scenario:
- High write volumes delay Elasticsearch indexing, causing outdated search results.
Impact:
- Users may not find recently sent messages or added contacts in search.
Mitigation:
- Batch updates to Elasticsearch to optimize indexing performance.
- Use near-real-time indexing to reduce delays.
- Optimize query performance with well-structured mappings and tuned shard configurations.

8. Push Notification Failures

Scenario:
- A dependency on external providers like FCM or APNs results in downtime or delayed notifications.
Impact:
- Offline users miss important updates (e.g., new messages, group invites).
Mitigation:
- Maintain fallback mechanisms (e.g., retry policies) for failed notifications.
- Queue notifications in Redis or Kafka to ensure eventual delivery when the provider is back online.
- Monitor provider health and switch providers dynamically if issues are detected.

9. Real-Time Analytics Bottlenecks

Scenario:
- High-frequency metric updates (e.g., DAU, message counts) overload DynamoDB or Spark jobs.
Impact:
- Real-time dashboards become slow or inaccurate.
Mitigation:
- Use pre-aggregated metrics for common queries (e.g., hourly updates instead of minute-by-minute).
- Optimize DynamoDB partition keys to distribute writes evenly.
- Scale Spark jobs horizontally to handle high-volume data processing.

10. Network Latency or Outages

Scenario:
- Network issues between clients and servers or between internal services cause delays or failures.
Impact:
- Real-time messaging and updates become inconsistent or unavailable.
Mitigation:
- Use distributed server deployments across multiple regions to minimize latency.
- Implement retries with exponential backoff for transient network failures.
- Use circuit breakers to gracefully handle service outages.

11. Single Points of Failure

Scenario:
- A critical component (e.g., API Gateway, Redis, PostgreSQL) fails without redundancy.
Impact:
- The entire system or major functionalities become unavailable.
Mitigation:
- Deploy redundant instances and clusters for all critical components.
- Use health checks and failover mechanisms for high availability.
- Regularly test disaster recovery processes to ensure minimal downtime.

Future improvements

API Gateway Resilience: Use multi-region deployments and load balancers to ensure high availability and mitigate single points of failure.

Database Scaling: Implement sharding, read replicas, and auto-scaling clusters to handle peak loads and reduce latency.

Optimized Redis Usage: Introduce tiered caching with TTL policies and cluster scaling to avoid memory exhaustion.

WebSocket Scalability: Use serverless WebSocket architectures and connection multiplexing for efficient resource utilization.

Improved Kafka Processing: Partition rebalancing and consumer optimization for faster and balanced event processing.

Faster Media Access: Use CloudFront for caching and regional replication to minimize AWS S3 latency.

Search Optimization: Implement real-time Elasticsearch indexing and shard tuning to reduce query delays.

Reliable Notifications: Integrate multiple push notification providers with retry mechanisms to ensure delivery.

Analytics Enhancement: Use pre-aggregation and data lakes for faster dashboards and scalable historical analysis.

Fault-Tolerant Networks: Deploy services across multiple regions with DNS failover to handle outages seamlessly.