Kafka Durability and acks: Three Settings, Three Different Guarantees

December 22, 2025

Kafka durability is not a boolean. It is a three-way choice on the producer, gated by a second setting on the broker, and the combination is what actually determines whether data survives a failure.

acks=0 is fire-and-forget. The producer never waits for a broker response. Throughput is highest, latency is lowest, and any network blip or broker restart silently drops records. Use this only for telemetry where you do not care about individual events.

acks=1 waits for the partition leader to write to its local log. This is the historical default and it is the setting that bites people. In steady state it looks identical to acks=all. Records get acked, consumers see them, life is good. The problem shows up during a leader transition. If the leader crashes after acking a record but before any follower has fetched it, the new leader has no record of that write. The producer thinks the write succeeded. The data is gone.

acks=all waits for all in-sync replicas. Pair it with min.insync.replicas=2 on the broker side and a topic with replication factor 3. Now a produce request only acks once at least two replicas have the data, which means you tolerate one broker failure with zero data loss. If only one replica is alive, the broker refuses the write with NotEnoughReplicasException, which is the correct behavior. You wanted durability, not availability.

A metrics team I worked with ran acks=1 to maximize throughput. Their pipeline ingested billions of events a day and they were proud of the volume. During a planned rolling restart of the brokers, one partition lost its leader for about four seconds while the controller elected a new one. Producers had open acks=1 requests against the old leader's TCP socket. Some of those requests had been acked by the old leader in its final seconds, but the followers had not fetched yet because they were also restarting in sequence. The new leader took over without those records. Roughly 1.2 million metrics events vanished. There was no error in any log. The producer's metrics counter said 100% success. The consumer just never saw the data.

The fix was acks=all and min.insync.replicas=2. Throughput dropped about 8%. The next rolling restart, six months later, lost zero events.

One piece many people miss. acks=all does not mean "all replicas." It means "all in-sync replicas." If only one replica is currently caught up, acks=all reduces to acks=1. That is what min.insync.replicas=2 guards against: it tells the broker to refuse writes when the ISR shrinks below your safety floor.

Rule of thumb. Logs and metrics that you can lose without anyone noticing: acks=1 is fine. Anything you would have to apologize for losing: acks=all with min.insync.replicas=2.

Key takeaway

acks=1 looks safe in steady state and silently loses data during leader elections. acks=all combined with min.insync.replicas=2 on a RF=3 topic is the only setting that survives a single broker failure with zero loss.

Originally posted on LinkedIn. View original.