Redis Durability: RDB, AOF, and the Lie of 'Just a Cache'

January 12, 2026

Redis is fast because it lives in memory. The cost of that speed is that crashes are interesting. What you lose on a crash depends entirely on which persistence story you configured, and most teams configure one of them by accident.

RDB is point-in-time snapshotting. Redis forks, walks the keyspace, and writes a binary dump to disk on an interval. Recovery is fast because you slurp one file back into memory. The cost is that everything between the last snapshot and the crash is gone. If you snapshot every five minutes, you can lose up to five minutes of writes.

AOF is the opposite tradeoff. Every write command appends to a log file, and on restart Redis replays the log. The durability knob is appendfsync. With appendfsync always, every write does a synchronous fsync, and latency reflects disk speed rather than memory speed. With appendfsync everysec, the fsync happens in a background thread once a second, and a crash can lose up to that second of writes. The default modern setup runs RDB plus AOF on top, with an RDB preamble inside the AOF file so recovery starts fast and replays only the recent tail.

Replication is the third leg, and it is the one most people confuse with durability. Replication protects availability, not writes. The leader streams commands to replicas asynchronously. Sentinel or Cluster promotes a replica when the leader looks dead. Any write that was acknowledged to the client but not yet shipped to the promoted replica is gone. WAIT N T lets you block until N replicas have caught up, which is closer to a quorum but still not consensus.

The production failure I keep seeing involves teams who insist Redis is "just a cache" and turn persistence off with save "". One team ran exactly this setup. A Sentinel-triggered failover promoted a replica that had been restarted that morning for an OS patch and never warmed back up. The promoted node came up with zero keys. Every downstream service that depended on the cache experienced a 100 percent miss rate at the same instant. They all stampeded the primary database, the database fell over within ninety seconds, and the platform was effectively down for forty minutes while the database recovered and the cache slowly refilled.

The fix was unglamorous. RDB was turned back on, with a 15-minute snapshot interval, on every Redis node including the cache instances. The deploy and failover playbooks added a warmup step: a promoted replica had to hit a configured key count, or replay a warmup script, before traffic was routed to it. The cache stayed a cache. The difference is that it now restarts with a memory of itself.

The cheapest insurance you can buy on Redis is a snapshot file you hope you never need.

Key takeaway

Persistence is not a binary. RDB, AOF, and replication solve different failure modes, and even a pure cache benefits from a cheap snapshot so a cold failover does not stampede the database.

Originally posted on LinkedIn. View original.