Dead Letter Queues

Course

System Design Fundamentals

Dead Letter Queues

Topics Covered

Dead Letter Queues, Poison Pills, Reprocessing, and Backfills in Distributed Messaging

How it works

Poison pills: messages that can never succeed

Message Reprocessing Strategies

Backfills in Messaging Systems

Comparison of Kafka, RabbitMQ, and SQS DLQ Strategies

Kafka: build your own

RabbitMQ: broker-managed with dead-letter exchanges

SQS: fully managed

Best Practices and Common Pitfalls

Dead Letter Queues, Poison Pills, Reprocessing, and Backfills in Distributed Messaging

What happens when a consumer cannot process a message? If it retries forever, the entire queue stalls. If it silently drops the message, data is lost. A Dead Letter Queue (DLQ) is the escape valve: it quarantines messages that repeatedly fail so the main pipeline keeps moving while operators investigate.

How it works

Consumer receives a message from the main queue
Processing fails (exception, timeout, validation error)
Broker or consumer retries up to a max retry count (typically 3-5)
After exhausting retries, the message is routed to the DLQ instead of being retried again
Operators inspect the DLQ, fix the root cause, and replay messages

Poison pills: messages that can never succeed

A poison pill is a message that will fail on every attempt regardless of how many times you retry. Examples: malformed JSON, missing required fields, schema version the consumer does not understand, a reference to a deleted database record.

The danger of poison pills is head-of-line blocking in ordered queues. If the consumer cannot process message N, and the queue guarantees ordering, messages N+1 through N+1000 are stuck behind it. The DLQ breaks this deadlock by removing the poison pill from the main flow.

1# Pseudocode: consumer with DLQ routing
2def consume(message):
3    for attempt in range(MAX_RETRIES):
4        try:
5            process(message)
6            ack(message)
7            return
8        except TransientError:
9            backoff(attempt)
10        except PermanentError:
11            break  # no point retrying
12    publish_to_dlq(message, error=last_error)
13    ack(message)  # remove from main queue

The ack after DLQ routing is critical. Without it, the broker redelivers the poison pill to the main queue and the cycle repeats.

Key Insight

The most important design decision for a DLQ is distinguishing transient errors (network timeout, database connection refused) from permanent errors (invalid JSON, unknown schema version). Transient errors should retry with exponential backoff. Permanent errors should skip retries entirely and go straight to the DLQ. Without this distinction, you waste retry budget on messages that will never succeed.

Message Reprocessing Strategies

Messages land in a DLQ because something went wrong: bad consumer code, corrupted data, a downstream service outage. Once the root cause is fixed, you need to replay those messages. But replaying is not as simple as "move them back to the main queue." Without safeguards, you risk duplicate processing, ordering violations, and re-triggering the same failure.

Strategy 1: Replay to main queue. Read messages from the DLQ and publish them back to the original queue or topic. The fixed consumer processes them normally. This is the simplest approach but requires the consumer to be idempotent, because if the original processing partially succeeded (e.g., database write succeeded but the ack failed), replaying creates duplicates unless the consumer checks for existing records.

Strategy 2: Replay to a staging queue. Publish DLQ messages to a separate staging queue consumed by a dedicated replay worker. The replay worker applies additional validation or transformation before forwarding to the main queue. This adds a safety layer. You can inspect and filter messages before they re-enter the main flow.

Strategy 3: Manual one-by-one processing. For critical messages (payments, orders), operators review each DLQ message individually, verify the fix, and replay them one at a time with manual confirmation. Slower but safest for high-value transactions.

Key rule: always verify the fix before replaying. If you replay 10,000 messages from a DLQ without fixing the bug that caused them to fail, all 10,000 immediately re-enter the DLQ, wasting time and creating noise.

Common Pitfall

Never replay DLQ messages without verifying the consumer fix first. Deploy the fix, process a small test batch (5-10 messages), confirm they succeed, then replay the full DLQ. Replaying thousands of messages into a still-broken consumer floods the DLQ again and can trigger alerts, on-call escalations, and cascading failures in downstream systems.

Backfills in Messaging Systems

A backfill reprocesses a range of historical messages, typically to fix a bug, populate a new data store, or rebuild a derived view. Unlike DLQ replay (which handles individual failed messages), backfills operate on entire time ranges or offset ranges.

Kafka backfills leverage the log's retention. Reset the consumer group offset to an earlier position and let the consumer reprocess everything from that point forward. The events are already in the log. No one needs to re-emit them.

Queue-based backfills are harder because acknowledged messages are deleted. You must reconstruct the event stream from an external source (database, data warehouse, backup) and re-publish the messages to the queue. This is slower, error-prone, and requires the original data to be available somewhere outside the broker.

Backfill safety checklist:

Idempotent consumers: the most critical requirement. Backfilled messages may overlap with already-processed data. Without idempotency, you get duplicates. * Rate limiting: backfills can produce millions of messages. Flooding the main consumer at full speed may overwhelm downstream systems (database, APIs). Apply a configurable rate limit to the backfill producer. * Separate consumer group: use a dedicated backfill consumer group with its own offset tracking so the live consumer group is not affected. * Monitoring: track backfill progress (percentage complete, current offset vs target offset) and have a kill switch to stop the backfill if downstream systems degrade.

Comparison of Kafka, RabbitMQ, and SQS DLQ Strategies

Each messaging system handles DLQs differently. The level of built-in support varies dramatically, from fully managed (SQS) to completely DIY (Kafka).

Feature	Kafka	RabbitMQ	Amazon SQS
Native DLQ	No	Yes (dead-letter exchange)	Yes (redrive policy)
Retry counting	Application-managed	Broker-tracked (x-delivery-count header)	Broker-tracked (ApproximateReceiveCount)
Max retries config	Application code	Queue-level (x-max-delivery-count)	Queue-level (maxReceiveCount)
DLQ routing	Application publishes to DLQ topic	Broker routes to dead-letter exchange	Broker routes to DLQ after maxReceiveCount
Replay from DLQ	Consumer reads DLQ topic, publishes to main topic	Shovel plugin or manual re-publish	Redrive to source (native since 2021)
Backfill	Offset reset (events in log)	Must re-emit from external source	Must re-emit from external source

Kafka: build your own

Kafka has no built-in DLQ. The consumer application must implement retry logic, error classification, and DLQ topic publishing. This gives maximum flexibility but requires more code. The typical pattern is a DLQ topic per source topic (e.g., orders.dlq for orders) with the same partition count.

RabbitMQ: broker-managed with dead-letter exchanges

Configure a dead-letter exchange (DLX) on the queue. When a message exceeds x-max-delivery-count or is explicitly rejected with requeue=false, the broker routes it to the DLX. The DLX can route to any queue, including a dedicated DLQ. This requires zero application-level DLQ code.

SQS: fully managed

Set a redrive policy on the source queue specifying the DLQ ARN and maxReceiveCount. After that many failed receives, SQS automatically moves the message to the DLQ. Since 2021, SQS supports "redrive to source," a one-click operation to move DLQ messages back to the original queue.

Interview Tip

In interviews, know that Kafka has no native DLQ. You must build it. This is a common follow-up question after discussing Kafka error handling. Show the pattern: catch the exception, classify the error, publish to a .dlq topic with error metadata, commit the offset on the main topic. Bonus points for mentioning that you need to handle the case where the DLQ publish itself fails.

Best Practices and Common Pitfalls

DLQ implementations look simple on paper but fail in subtle ways in production. Here are the patterns that work and the mistakes that cause incidents. Preserve context. Every message routed to a DLQ should carry metadata: original topic or queue, timestamp, error message, stack trace, retry count, consumer instance ID. Without this, operators waste hours figuring out why a message failed and where it came from. Set DLQ retention longer than the main queue. If your main topic retains events for 7 days, set the DLQ to 30 days. DLQ messages often sit unresolved for days while the team investigates and deploys fixes. If the DLQ retention is too short, messages are deleted before they can be replayed. Alert on DLQ depth, not just consumer errors. Consumer error logs are noisy and easy to miss. A growing DLQ depth is a clear, measurable signal that something is wrong. Alert when depth exceeds zero (or a small threshold) and escalate when depth grows continuously. Never use the DLQ as permanent storage. DLQs are quarantine, not archive. Messages should be investigated, fixed, and replayed (or explicitly discarded with documentation). A DLQ with 100,000 messages from 6 months ago is a liability. It means the team is ignoring failures. Test your DLQ path. Many teams build DLQ routing but never test it. In production, the first DLQ message reveals bugs in the DLQ pipeline itself: serialization failures, missing permissions on the DLQ topic, or the replay tool publishing to the wrong queue. Run chaos tests that intentionally produce poison pills and verify the full cycle: failure, DLQ routing, investigation, fix, and replay.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios

Dead Letter Queues

Dead Letter Queues, Poison Pills, Reprocessing, and Backfills in Distributed Messaging

How it works

Poison pills: messages that can never succeed

Message Reprocessing Strategies

Backfills in Messaging Systems

Comparison of Kafka, RabbitMQ, and SQS DLQ Strategies

Kafka: build your own

RabbitMQ: broker-managed with dead-letter exchanges

SQS: fully managed

Best Practices and Common Pitfalls

1/27