Write-Ahead Logging: The Real Boundary Between Committed and Lost

February 12, 2026

A Write-Ahead Log is the contract a database signs with the disk. Before a transaction touches a data page in the buffer pool, the change is described as a record and appended to a sequential log file. The page itself is modified in memory and stays dirty for a while. Only the log record has to be on stable storage before the database can answer "committed" to the client. If the process dies, the kernel panics, or the rack loses power, the log is replayed on restart and the data pages are reconstructed.

The reason this is fast is that the WAL is purely sequential. A spinning disk or an SSD can sustain orders of magnitude more throughput on append-only writes than on the random writes that updating data pages in place would require. The database trades one random write per transaction for one tiny log append, then flushes data pages lazily in batches during a checkpoint.

The actual durability boundary is the fsync on the log file. Without fsync, the write sits in the page cache and the kernel reports success. A power loss at that moment loses the write, even though the application got a 200 back. This is why every serious database treats the WAL fsync as the moment a transaction becomes durable, not the moment the COMMIT statement returned from the parser.

Group commit is how databases keep this honest under load. If a hundred transactions are ready to commit at the same time, the engine bundles their log records into one write and issues a single fsync. Each transaction pays one hundredth of the syscall cost. Throughput rises with concurrency instead of being capped by disk latency.

The production failure I keep seeing is the team that sets synchronous_commit = off in Postgres to chase throughput. The setting tells Postgres to return success before the WAL fsync completes. Reads look great, write latency drops, the dashboard is green. Then a UPS fails during a thunderstorm, the database restarts cleanly, and 12 seconds of committed transactions are gone. Orders, payments, audit rows, all gone. The API had already returned 200 OK to clients. There is no way to recover what was never persisted.

The fix is to keep fsync on for any OLTP workload, lean on group commit and larger WAL buffers for throughput, and put the WAL on NVMe with a battery-backed write cache if latency really matters. Durability is the one thing you cannot get back by retrying.

Key takeaway

Durability is not a property of your database. It is a property of the fsync that flushed your WAL record to stable storage. Everything else is a tuning knob around that single system call.

Originally posted on LinkedIn. View original.