Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Distributed Messaging System with Score: 8/10

by iridescent_luminous693

System requirements

Functional Requirements

Message Delivery:
- Guarantee at-least-once, at-most-once, or exactly-once delivery semantics.
- Support for durable and non-durable messages.
- Prioritize messages with specific delivery orders when required.
Messaging Patterns:
- Publish/Subscribe: Enable multiple subscribers to receive messages from a single publisher.
- Request/Reply: Allow requesters to send a message and receive a corresponding reply.
- Point-to-Point: Support direct messaging between producers and consumers.
Scalability:
- Handle millions of messages per second in a distributed setup.
- Add brokers dynamically to increase capacity.
Fault Tolerance:
- Ensure no message loss in case of broker or network failures.
- Replicate messages across nodes for durability.
Message Tracking:
- Provide mechanisms for tracking and logging message delivery.
- Support dead-letter queues for undeliverable messages.
Security:
- Encrypt messages in transit and at rest.
- Authenticate producers and consumers with access control.

Non-Functional Requirements

Latency:
- Maintain low latency (<10ms) for message delivery under normal conditions.
Throughput:
- Support high throughput of >1 million messages per second in peak conditions.
Availability:
- Ensure 99.99% uptime with no single point of failure.
Durability:
- Persist messages to prevent data loss in case of broker crashes.
Flexibility:
- Support various message sizes and types (e.g., JSON, binary).
Observability:
- Provide real-time monitoring of message flows, broker health, and topic statistics.

Capacity estimation

Estimate the scale of the system you are going to design...

User Base: 1 million producers and 10 million consumers globally.

Message Throughput:

Peak: 5 million messages per second.
Average: 500 million messages per day.

Storage:

Average message size: 1 KB.
Total storage: ~500 GB/day (compressed), requiring efficient storage for multi-day retention.

Replication:

3x replication for durability: ~1.5 TB/day.

API design

Define what APIs are expected from the system...

Producer APIs:

POST /messages: Publish a message to a topic/queue.
GET /producers/{id}/status: Retrieve the producer's health or connection status.

Consumer APIs:

GET /messages: Poll for messages from a topic/queue.
POST /acknowledge: Acknowledge receipt of a message.

Admin APIs:

POST /topics: Create a new topic.
DELETE /topics/{id}: Delete a topic.
GET /topics: List all topics and metadata.
GET /monitoring: Fetch broker health and statistics.

Monitoring APIs:

GET /metrics/latency: Monitor message delivery latency.
GET /metrics/throughput: Retrieve message throughput statistics.

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

1. Metadata Store

Schema Details:

sql
Copy code
Table: Topics
Columns:
  - topic_id (UUID, PK)
  - name (VARCHAR)
  - partitions (INTEGER)
  - replication_factor (INTEGER)
  - created_at (TIMESTAMP)

Purpose: Stores metadata about topics, partitions, and replication configurations.
Tech Used: Zookeeper or Etcd.
Trade-Off: Highly consistent metadata management but adds operational complexity due to coordination overhead.

2. Message Store

Schema Details:

sql
Copy code
Table: Messages
Columns:
  - message_id (UUID, PK)
  - topic_id (FK)
  - partition_id (INTEGER)
  - offset (BIGINT)
  - payload (BLOB)
  - timestamp (TIMESTAMP)

Purpose: Stores messages with offsets for durable storage and retrieval.
Tech Used: Apache Kafka (log-based storage) or RocksDB.
Trade-Off: High write throughput but requires additional disk space for logs and retention policies.

3. Consumer Offset Store

Schema Details:

sql
Copy code
Table: Offsets
Columns:
  - consumer_id (UUID, PK)
  - topic_id (FK)
  - partition_id (INTEGER)
  - offset (BIGINT)
  - updated_at (TIMESTAMP)

Purpose: Tracks the last processed offset for each consumer.
Tech Used: PostgreSQL or Kafka Internal Offsets Topic.
Trade-Off: Guarantees consistency but can create bottlenecks during frequent offset updates.

4. Dead-Letter Queue (DLQ)

Schema Details:

sql
Copy code
Table: DeadLetterMessages
Columns:
  - message_id (UUID, PK)
  - topic_id (FK)
  - consumer_id (UUID)
  - reason (TEXT)
  - payload (BLOB)
  - timestamp (TIMESTAMP)

Purpose: Stores undeliverable messages for later inspection or retry.
Tech Used: Redis (short-term) or Cassandra.
Trade-Off: Fast writes but requires additional storage management for long-term data retention.

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

1. Producer

Overview: Applications or services that send messages to the messaging system.
Role:
- Publishes messages to a specific topic or queue.
- Can set metadata like priority, partition keys, or headers for routing.
Key Features:
- Supports batching of messages to improve throughput.
- Idempotency to prevent duplicate messages.

2. Broker

Overview: The central component that routes, stores, and delivers messages between producers and consumers.
Role:
- Manages topics, partitions, and message offsets.
- Handles replication for fault tolerance.
Key Features:
- Supports message durability by persisting messages to storage.
- Ensures message delivery guarantees (e.g., at-least-once, at-most-once).

3. Consumer

Overview: Applications or services that receive and process messages.
Role:
- Subscribes to topics or queues to retrieve messages.
- Acknowledges messages after processing.
Key Features:
- Supports consumer groups for parallel processing.
- Allows backpressure handling by controlling message flow.

4. Topic/Queue

Overview: Logical grouping of messages.
Role:
- Topics support publish/subscribe patterns where multiple consumers can subscribe.
- Queues support point-to-point messaging for a single consumer.
Key Features:
- Partitioned for parallel processing.
- Configurable retention policies.

5. Metadata Store

Overview: Tracks information about topics, partitions, and offsets.
Role:
- Manages configurations like partition mappings and replication settings.
- Ensures consistency across brokers.
Key Features:
- Enables leader election for partitions.
- Stores consumer offsets for reliable processing.

6. Storage Layer

Overview: Stores messages durably for configured retention periods.
Role:
- Provides fault tolerance by persisting messages to disk or distributed storage.
Key Features:
- Optimized for high-throughput sequential writes.
- Supports configurable retention and compaction.

7. Monitoring and Metrics

Overview: Tracks the health and performance of the system.
Role:
- Monitors message latency, throughput, and system health.
Key Features:
- Alerts on anomalies like broker failures or high latency.
- Provides real-time insights into message delivery.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Scenario: Producer sends a message to a topic, and a consumer processes it.

Step 1: Producer Publishes a Message

Producer Action:
- Sends a message to the broker via the Producer API.
- Specifies the topic, optional partition key, and metadata (e.g., headers).
Broker Processing:
- Validates the producer’s credentials and topic existence.
- Determines the target partition using the partition key (if provided) or a hashing algorithm.
- Appends the message to the partition log.
- Acknowledges the producer once the message is replicated to configured replicas.

Step 2: Broker Stores the Message

Message Persistence:
- Stores the message in a log-based storage system (e.g., Kafka’s partition log).
- Replicates the message to other brokers for fault tolerance.
- Updates the message offset for the partition.
Metadata Update:
- Updates the metadata store to reflect the new message offset and partition status.

Step 3: Consumer Polls for Messages

Consumer Action:
- Sends a polling request to the broker via the Consumer API.
- Specifies the topic, partition, and current offset (or starts from the latest/earliest offset).
Broker Processing:
- Fetches messages from the requested partition starting at the specified offset.
- Delivers messages to the consumer in batches for efficiency.

Step 4: Consumer Processes Messages

Message Processing:
- Processes each message in the batch (e.g., updates a database, triggers downstream events).
Acknowledgment:
- Sends an acknowledgment to the broker after successful processing.
- The broker updates the offset store to mark the messages as processed.

Step 5: Monitoring and Logging

Metrics Collection:
- Tracks message throughput, latency, and processing times.
- Updates monitoring dashboards and alerts on anomalies.
Audit Logging:
- Logs message delivery for debugging and compliance purposes.

Handling Key Scenarios

Message Loss:
- Messages are replicated to multiple brokers to ensure durability.
- Consumers can replay messages using offsets if processing fails.
Consumer Failures:
- Consumer groups rebalance and redistribute partitions among remaining active consumers.
Broker Failures:
- Leader election ensures a new broker takes over partition leadership seamlessly.
Network Issues:
- Retry mechanisms and exponential backoff handle temporary connectivity issues.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

1. Producer Service

End-to-End Working

The Producer Service is responsible for publishing messages to the messaging system:

Message Creation:
- The producer application creates a message with optional metadata (e.g., headers, partition keys).
- Metadata helps with routing, prioritization, and delivery guarantees.
Message Validation:
- Before sending, the producer validates the topic existence and ensures compliance with message size limits.
Publishing:
- The producer sends the message to a broker handling the target topic.
- The broker determines the partition using a hashing algorithm (if no specific partition is provided).

Communication

Protocol:
- gRPC or HTTP/2 for efficient, low-latency communication.
- TLS/SSL for encryption to ensure secure message transmission.
Retries:
- If the broker is unreachable, the producer retries sending with exponential backoff and a cap on retry attempts.

Data Structures & Algorithms

Partitioning Algorithm:
- Uses consistent hashing to map messages to partitions. This ensures that messages with the same key always go to the same partition, enabling message ordering.
Message Batching:
- Implements batch sending for higher throughput. Messages are aggregated into batches before transmission, reducing overhead.
Acknowledgments:
- The producer waits for acknowledgment from the broker before marking the message as sent. This ensures durability.

Scaling for Peak Traffic

Horizontal Scaling:
- Producers scale horizontally by deploying more instances to handle increased workloads.
Load Balancing:
- Producers distribute messages across brokers using DNS-based load balancing or direct broker discovery.
Buffering:
- In-memory buffers temporarily store messages when broker communication is delayed, ensuring producers are not blocked.

Edge Cases

Broker Unavailability:
- Messages are queued in the producer's local buffer for retries.
Message Size Exceeds Limit:
- Split large messages into smaller chunks or reject them with an error response.
Partition Overload:
- Dynamically redistribute partitions or notify producers to throttle message rates.

2. Broker Service

End-to-End Working

The Broker Service is the central component that routes, stores, and delivers messages:

Message Reception:
- Receives messages from producers and determines the appropriate partition using the hashing algorithm or metadata.
Message Storage:
- Appends the message to the partition log, ensuring sequential ordering.
- Replicates the message to other brokers for fault tolerance.
Consumer Delivery:
- Reads messages from the log and sends them to subscribed consumers.

Communication

Protocol:
- TCP/IP for reliable delivery between brokers and producers/consumers.
- Raft or Paxos protocols for leader election and replication in distributed setups.
Event Propagation:
- Uses gRPC for internal communication between brokers.

Data Structures & Algorithms

Partition Log:
- Each partition is a sequential write-ahead log. Messages are appended at the end, providing fast writes.
Replication:
- Implements the Raft consensus algorithm to maintain consistency across replicas.
Leader Election:
- Brokers use leader election (e.g., Raft) to ensure only one broker writes to a partition at a time.

Scaling for Peak Traffic

Partitioning:
- Topics are divided into multiple partitions, allowing parallel processing across brokers.
Broker Scaling:
- Add new brokers dynamically to distribute partitions and reduce load.
Replication:
- Data is replicated to multiple brokers, enabling load distribution and fault tolerance.

Edge Cases

Broker Crash:
- A new leader is elected for affected partitions, and consumers switch to the new leader seamlessly.
Disk Full:
- Retention policies delete old messages to free up space.
Partition Rebalancing:
- Brokers redistribute partitions when new brokers join or existing ones leave.

3. Consumer Service

End-to-End Working

The Consumer Service retrieves and processes messages:

Subscription:
- Subscribes to one or more topics and registers with the broker.
Message Fetching:
- Polls the broker for messages or establishes a push-based connection for real-time delivery.
Acknowledgment:
- Acknowledges receipt of messages after successful processing.

Communication

Protocol:
- gRPC or HTTP/2 for low-latency message delivery.
Consumer Group Coordination:
- Uses Zookeeper or Etcd for managing consumer group metadata and partition assignments.

Data Structures & Algorithms

Offset Management:
- Stores the last processed offset for each partition in a dedicated offset store.
Load Balancing:
- Uses a round-robin algorithm to distribute partitions across consumer group members.
Backpressure Handling:
- Implements flow control to slow down producers if consumers cannot keep up.

Scaling for Peak Traffic

Consumer Groups:
- Consumers within a group share partitions for parallel processing.
Dynamic Rebalancing:
- When a consumer joins or leaves, partitions are reassigned dynamically to ensure balanced processing.
Throttling:
- Brokers throttle message delivery rates to avoid overwhelming consumers.

Edge Cases

Consumer Crash:
- Unacknowledged messages are redelivered to another consumer.
Slow Consumers:
- Implement dead-letter queues for undeliverable messages or notify administrators.
Out-of-Order Processing:
- Ensure ordering by processing messages sequentially within a partition.

4. Metadata Store

End-to-End Working

The Metadata Store manages configuration and state for topics, partitions, and offsets:

Metadata Updates:
- Tracks topic creation, partition assignments, and consumer group offsets.
Coordination:
- Facilitates leader election and partition rebalancing among brokers.

Communication

Protocol:
- Uses Zookeeper or Etcd for high-availability metadata storage.
- Communicates with brokers using RPC or HTTP APIs.

Data Structures & Algorithms

Metadata Tree:
- Topics, partitions, and offsets are stored in a hierarchical structure for efficient lookups.
Consensus Algorithms:
- Uses Raft or Paxos for distributed consistency in leader election and metadata updates.

Scaling for Peak Traffic

Sharding:
- Metadata is divided into smaller subsets for parallel processing.
Caching:
- Frequently accessed metadata is cached in memory for faster access.

Edge Cases

Metadata Corruption:
- Periodic snapshots and backups restore consistency.
Leader Election Failures:
- Retry mechanisms with exponential backoff mitigate transient network issues.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Partitioned Topic Architecture:

Choice: Partitioned topics for scalability and parallelism.
Trade-Off: Sacrifices strict global ordering; ordering is maintained only within partitions.

Replication for Fault Tolerance:

Choice: Use 3x replication for durability and availability.
Trade-Off: Increases storage and network overhead but ensures data safety.

Message Acknowledgment:

Choice: At-least-once delivery for reliability.
Trade-Off: Potential for duplicate messages, requiring idempotent consumers.

Leader-Based Partition Handling:

Choice: Use leader-based partition management for high performance.
Trade-Off: Leader failure can cause temporary unavailability until a new leader is elected.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Broker Failure:

Scenario: A broker crashes, causing temporary unavailability.
Mitigation: Use replication and automatic leader election to maintain availability.

Consumer Lag:

Scenario: Slow consumers cause message accumulation in partitions.
Mitigation: Implement backpressure handling and dead-letter queues for undeliverable messages.

Partition Overload:

Scenario: Uneven traffic distribution overloads specific partitions.
Mitigation: Use consistent hashing and dynamic partition rebalancing.

Message Loss:

Scenario: Message loss due to incomplete replication during broker crashes.
Mitigation: Ensure acknowledgments only after successful replication to all replicas.

Metadata Store Bottleneck:

Scenario: High-frequency metadata updates overwhelm the store.
Mitigation: Cache frequently accessed metadata and shard the metadata store.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Support Exactly-Once Semantics:

Improvement: Add transactional guarantees for end-to-end exactly-once delivery.
Mitigation: Use transactional offsets and idempotent producer mechanisms.

Dynamic Load Balancing:

Improvement: Introduce real-time monitoring and adaptive load balancing for partitions.
Mitigation: Implement AI-based traffic prediction to adjust resource allocation.

Enhanced Security:

Improvement: Add per-topic ACLs and encryption for message payloads at rest.
Mitigation: Regularly audit access logs and rotate encryption keys.

Optimized Storage:

Improvement: Introduce tiered storage for older, less accessed messages.
Mitigation: Move older messages to cheaper storage like object stores (e.g., S3) while keeping hot data in high-speed disks.