My Solution for Design Facebook Messenger

by quest3775

Core Functional Requirements

  • Direct and Group Messaging: Users can send messages to individuals or groups, with support for up to 10 participants in each group chat.
  • Real-Time Message Delivery: Ensures immediate delivery of messages within any chat context.
  • Interaction Capabilities: Users can like and comment on messages for enhanced engagement.


Non-Functional Requirements

  • Latency: Minimize message delivery time to ensure a seamless conversation experience.
  • Read Efficiency: Prioritize the speed and efficiency of retrieving messages, given the higher volume of read operations compared to writes.
  • System Availability: Maintain high availability to ensure continuous service operation, even under heavy loads.


Capacity Estimations and Data Management

  • User Base: 1 billion active users.
  • Daily Messaging Volume: 100 messages per user, resulting in 10PB of data generated daily, and totaling approximately 4PB annually.
  • Data Storage Strategy: Messages are stored with careful consideration for partitioning and indexing to optimize for read operations and ensure scalability.


API Endpoints

  • SendMessage: send_message(user_id, chat_id, message_text) allows sending messages.
  • ReadChat: read_chat(user_id, chat_id) for retrieving chat history.
  • AddUserToChat: add_user_to_chat(user_id, chat_id) to include users in chats.
  • CreateChat: create_new_chat(user_id) for initiating new chat sessions.


Database Schema and Design

  • Chat Members Database: Maps users to their chat sessions for quick retrieval, partitioned by user_id for efficiency.
  • Users Database: Stores user credentials and metadata, optimized for infrequent updates.
  • Messages Database: Records all messages, indexed by chat_id and sorted by timestamp to maintain conversation order.


High-Level Architecture and Workflow


The architecture is structured to facilitate real-time messaging, supporting both direct and group chats with high availability, low latency, and scalable performance. Key components include:

  • Client Applications: User interfaces for sending and receiving messages.
  • Load Balancers: Distribute incoming requests evenly across servers to ensure optimal load handling and fault tolerance.
  • Web Servers: Handle client requests, including authentication, message sending, and fetching chat histories.
  • Zookeeper: Coordinates server assignment and manages session state to ensure users are always connected to the appropriate server.
  • Kafka Queues: Act as a message broker for reliable message queuing and distribution, ensuring messages are processed in order and without loss.
  • Flink/Spark Streaming: Process streaming data for real-time message delivery, user status updates, and chat session management.
  • Databases: Store user data, chat memberships, and messages, designed for fast reads and writes with scalability in mind.
  • Cache Layer: Improves read efficiency by caching frequently accessed data, such as recent messages and active chat lists.


Detailed Request Flow


User Connection and Session Establishment

  1. User Login: Upon logging in, the client application connects to the system via a Load Balancer, which routes the request to an available Web Server.
  2. Session Management: The Web Server interacts with Zookeeper to assign the user to a server, establishing a WebSocket connection for real-time communication. Zookeeper tracks active sessions and server loads to facilitate reassignment if needed.


Message Sending Process

  1. Message Creation: A user sends a message through their client application. This message, along with the user's ID and the chat's ID, is transmitted to the Web Server via the established WebSocket connection.
  2. Message Queueing: The Web Server forwards the message to a Kafka queue, specifically partitioned by chat_id to ensure messages for the same chat are processed sequentially.
  3. Real-time Processing: Flink/Spark Streaming consumes the message from Kafka, assigns a unique ID for idempotence, and processes it. This includes updating the Messages Database and preparing the message for distribution to recipients.


Chat Membership and Message Delivery

  1. Chat Membership Update: When a user creates a new chat or adds someone to a chat, the Web Server updates the Chat Members Database and streams this change to Kafka. Flink consumes these updates to maintain an internal mapping of chat-to-user relationships.
  2. Message Distribution: Flink instructs the Load Balancer to route the processed message to the appropriate Web Servers connected to the message recipients. The Load Balancer uses Zookeeper to identify these servers.
  3. Client Notification: The recipient's Web Server pushes the message to the client application over the WebSocket connection, enabling real-time message delivery.


Reading Messages and Chat Histories

  1. Fetching Chat Histories: When a user accesses a chat, the request is sent to the Web Server, which first attempts to retrieve the chat history from the Cache Layer.
  2. Database Query: If the requested data is not in the cache, the Web Server queries the Messages Database. The result is then cached for future requests, optimizing response times for active chats.


Detailed Component Workflow


User Registration and Session Management

  1. Account Registration: Users create accounts through an account service, which securely stores user information in the Users Database. This includes user credentials and metadata.
  2. Connection Establishment: Upon logging in, each user's client initiates a connection request to the system. The Load Balancer, guided by Zookeeper's coordination, directs the user to an appropriate server to establish a WebSocket connection. This connection remains open for real-time bi-directional communication.
  3. Session Resilience: Zookeeper monitors the health and load of servers. If a server becomes unavailable, Zookeeper updates the routing logic in the Load Balancer, redirecting new and disconnected users to alternative servers. This ensures high availability and fault tolerance.


Chat Operations

  1. Chat Creation and Membership Updates: When a user creates a new chat or adds participants to an existing chat, the server records this in the Chat Members Database. Using Change Data Capture (CDC), updates are streamed to a Kafka queue, which Flink consumes to update its internal mappings of user-to-chat relationships.
  2. Scalable Chat Member Management: The partitioning of the Chat Members Database by user_id ensures scalable, fast fetches of user chat memberships. This setup aids in efficiently managing group chats, including adding or removing members.


Message Processing and Delivery

  1. Message Sending: A user sends a message through their established WebSocket connection to the server. The server assigns a unique ID to the message to prevent duplicate processing (idempotence) and sends it to a Kafka queue dedicated to messages, partitioned by chat_id.
  2. Real-Time Processing: Flink processes messages from the Kafka queue, ensuring exactly-once processing semantics. It writes messages to the Messages Database, organized by chat_id and timestamped to maintain conversation order. The use of HBase supports efficient read operations with its B-tree index, ideal for retrieving message sequences.
  3. Message Distribution: Concurrently, Flink signals the Load Balancer (coordinated with Zookeeper) about the new message. The Load Balancer identifies which servers hold connections with the message recipients. Servers then push the message to relevant clients over WebSocket connections, ensuring real-time delivery.
  4. Optimized Read Path: When reading chat histories, clients request data through servers that first attempt to retrieve messages from a highly available Cache Layer. Cache misses trigger fetches from the Messages Database, where data is then cached for future reads, optimizing response times for active conversations.


Database Design Considerations

  • Messages Database Partitioning: Partitioning by chat_id allows for localized storage of chat messages, facilitating quick sequential access for entire chat histories. Sorting by timestamp within partitions maintains the natural order of messages.
  • Chat Members and Users Database: These databases are optimized for frequent reads with partitioning and indexing on user_id, ensuring rapid access to user-specific data. MySQL's use of single-leader replication provides a balance between write consistency and simplicity in managing user data and chat memberships.


Handling Failures and Bottlenecks

  • Load Balancing and Distributed System Coordination: The Load Balancer, in tandem with Zookeeper, dynamically routes traffic and manages server assignment based on current load and server health, enhancing system resilience.
  • Data Replication and Recovery: Kafka and Flink offer built-in mechanisms for data replication and state recovery, ensuring no message loss even in the event of component failures. Kafka's log-based storage and Flink's checkpointing enable the system to recover to a consistent state.
  • Cache Invalidation and Synchronization: The Cache Layer must be meticulously managed to prevent stale reads. Implementing strategies for cache invalidation and synchronization with the primary data store is critical, especially after updates or when serving cache misses.


Future Improvement Areas

  • Enhanced Load Distribution: Investigating more sophisticated load distribution algorithms and server pooling strategies could further improve system scalability and user experience.
  • Database Optimization: Continuous evaluation of database performance, including the potential adoption of newer storage technologies or configurations, could enhance both read and write efficiencies.
  • Fault Tolerance Enhancements: Expanding the use of distributed systems principles, such as additional redundancy layers and more granular service health checks, could bolster the system's resilience against failures.