DLQ and Retry Backoff: The Pattern That Saves Your Queue at 2am

February 27, 2026

Every queue-based system eventually meets a poison message. One payload that fails every time. The schema is wrong, the downstream service rejects it, an unhandled enum, a NaN that crashes the math. Without a plan, that single message destroys the queue.

Here is why. Kafka, RabbitMQ, SQS, Celery, TaskIQ, they all share one property: ordering inside a partition. A Kafka consumer reads offsets 100, 101, 102 in order. If 101 throws and you do not commit, the consumer reprocesses 101 on the next poll, throws again, reprocesses again. The partition is now stuck. Offsets 102 through whatever are sitting behind a single bad record, even though they would all succeed. One partition out of forty is wedged, but every customer mapped to that partition stops receiving updates. That is the failure mode people learn at 2am.

The fix is a tiered retry pipeline followed by a Dead Letter Queue.

The pipeline looks like this. The main topic feeds the consumer. On failure the consumer republishes to retry-1m, which is read by a worker that delays one minute and retries. Persistent failures escalate to retry-10m, then retry-1h. After N total attempts the message moves to the DLQ. A triage UI or alerting hook watches the DLQ. Once the bug is patched, a replayer reads the DLQ and re-injects messages back into the main topic, ideally protected by an idempotency key so the second processing is safe.

The exponential delays matter. A flat retry-every-second loop turns a transient downstream outage into a thundering herd that prevents recovery. Stretching from one minute to ten to an hour gives the dependency room to heal while keeping the bad message off your main flow.

Two practical notes from running this.

First, a Kafka DLQ is usually a separate topic, not a feature of the broker. You implement the routing in the consumer. Reuse the same key so the original partition assignment carries over for ordering inside the DLQ.

Second, alert on DLQ depth and the rate of growth, not just non-zero. A handful of messages a week is normal. A spike of a thousand in an hour means a deploy broke a contract and you should roll back, not triage by hand.

A retry pipeline without a DLQ is a denial-of-service attack against yourself. The DLQ is the pressure valve. The retries buy time. Together they keep the rest of the stream moving while you fix the one message that is actually broken.

Key takeaway

A DLQ is not a trash bin. It is a quarantine lane that pulls poison messages off the hot path so the rest of the queue keeps moving. Without it your retry policy is a denial-of-service against yourself.

Originally posted on LinkedIn. View original.