Caching Strategies: Cache-Aside, Write-Through, and Write-Behind

Topics Covered

Cache-Aside (Lazy Loading)

How it works

Why engineers choose it

The downsides

Write-Through Caching

How the flow works

When write-through shines

The costs

Write-Behind Caching (Write-Back)

How the flow works

Why teams use it

The serious trade-offs

When it fits vs when to avoid

Comparing Caching Strategies

Summary table

Consistency guarantees

How to mix and match

Time-to-Live (TTL) Selection for Cache Entries

Factors that drive TTL selection

Risks at the extremes

Advanced TTL strategies

Why is cache-aside the default choice for most systems? Because the database stays the source of truth, always. The cache is purely an optimization layer. If Redis dies at 3 AM, your service slows down but doesn't break.

Cache-aside pattern showing read path with cache miss and write path with cache invalidation

How it works

Read path: Application checks cache first. On a hit, return immediately. On a miss, query the database, populate the cache with a TTL, and return.

Write path: Application writes to the database first (source of truth), then deletes or updates the cache entry. The next read repopulates the cache with fresh data.

python
1def get_item(item_id):
2    cached = cache.get(item_id)
3    if cached is not None:
4        return cached  # cache hit
5
6    record = db.query("SELECT * FROM items WHERE id = ?", item_id)
7    if record is not None:
8        cache.set(item_id, record, ttl=3600)
9    return record
10
11def update_item(item_id, new_value):
12    db.execute("UPDATE items SET value=? WHERE id=?", (new_value, item_id))
13    cache.delete(item_id)  # invalidate; next read reloads

Why engineers choose it

  • Simple and flexible: Just get -> miss -> load -> set. Works with Redis, Memcached, or any key-value store.
  • Memory efficient: Only items that are actually accessed end up in cache, naturally approximating your working set.
  • Failure resilient: Cache is expendable. A crash means slower reads, not data loss. The database is always correct.

The downsides

  • First-read penalty: Every key's first access suffers a full DB round-trip. Cold starts after deployments or cache restarts hit hard.
  • Stale data window: Between a DB write and the cache delete/expire, the cache holds the old value. The staleness window equals your TTL or the time until explicit invalidation.
  • Thundering herd: When a popular key expires, hundreds of concurrent requests all miss and stampede the database simultaneously.
Interview Tip

In interviews, when asked about caching, lead with cache-aside. It is the safest default and interviewers expect you to know its trade-offs cold: stale data window, thundering herd on TTL expiry, and the first-read penalty. Then layer on write-through or write-behind only when the interviewer pushes on specific bottlenecks.

Write-through flips the consistency trade-off: every write updates both the database and the cache before returning to the caller. The cache is always fresh after a write, so the "stale data window" that plagues cache-aside disappears.

How the flow works

  1. Application receives a write request
  2. Application writes to the database first (source of truth)
  3. If the DB write succeeds, application writes the same value to the cache
  4. Only after both succeed does the application return success to the client
python
1def update_item(item_id, new_value):
2    db.execute("UPDATE items SET value=? WHERE id=?", (new_value, item_id))
3    cache.set(item_id, new_value, ttl=86400)  # long TTL as safety net
4    return True
5
6def get_item(item_id):
7    value = cache.get(item_id)
8    if value is not None:
9        return value  # almost always fresh
10    value = db.query("SELECT value FROM items WHERE id=?", (item_id,))
11    if value is not None:
12        cache.set(item_id, value, ttl=86400)
13    return value

The read path still uses cache-aside as a fallback (for cold keys or cache restarts), but since every write pre-populates the cache, the miss rate for recently-written data drops to near zero.


When write-through shines

  • Write-then-read patterns: User updates profile, immediately views it. Leaderboard score changes, dashboard refreshes. The cache already has the new value.
  • Cannot tolerate stale cache: Feature flags, authorization rules, pricing data where serving an old value has business consequences.
  • Simpler TTL story: Since writes keep the cache current, TTL serves as a safety net rather than the primary freshness mechanism. You can use longer TTLs (hours or days).

The costs

  • Slower writes: Every write pays DB latency + cache latency instead of just DB latency. On a write-heavy path, this adds up.
  • Cache pollution: Values that are written but never read waste memory. Batch-importing 1M records into the database also fills the cache with 1M entries that may never be accessed.
  • Dual-write failure modes: If the DB write succeeds but the cache write fails, the cache is stale. Most teams treat this as acceptable: the cache entry will be refreshed on the next cache-aside read or when TTL expires. The reverse (cache succeeds, DB fails) is worse, so you must ensure the DB write commits before touching the cache.
Common Pitfall

Never update the cache before the database commits. If the DB write fails or rolls back, the cache holds a phantom value that does not exist in the source of truth. Always sequence as DB-first, cache-second, and treat cache write failures as non-fatal since the next read will self-heal via cache-aside.

Write-behind inverts the write path entirely: the application writes only to the cache, and a background process flushes changes to the database later, typically in batches. Writes become extremely fast (cache latency only), but the database lags behind the cache until the flush completes.

Side-by-side comparison of write-through synchronous path versus write-behind asynchronous queue and flush

How the flow works

  1. Application writes to the cache and enqueues the write
  2. Cache acknowledges immediately, and the application returns success
  3. A background worker dequeues pending writes and flushes them to the database in batches
  4. Multiple updates to the same key can be coalesced (only the final value is written)
python
1pending_queue = Queue()
2
3def update_item(item_id, new_value):
4    cache.set(item_id, new_value)
5    pending_queue.enqueue((item_id, new_value))
6    return True  # returns before DB write
7
8def flush_to_db():
9    batch = pending_queue.dequeue_batch(size=100)
10    for item_id, value in batch:
11        db.upsert("items", item_id, value)

Why teams use it

  • Ultra-fast writes: Write latency equals cache write time (sub-millisecond). The application never waits for disk I/O.
  • Reduced DB load: 10 updates to the same key in 1 second can be coalesced into 1 DB write. Batching further reduces per-write overhead.
  • Resilience to short DB outages: The cache and queue absorb writes while the DB is briefly down. The system stays alive in degraded mode.

The serious trade-offs

  • Data loss risk: If the cache or queue crashes before flushing, pending writes are lost permanently. The database never received them.
  • DB lag: Other systems reading directly from the database see stale data. The cache is the temporary source of truth, not the database.
  • Ordering complexity: Multiple updates to related records must be flushed in the right order, or database constraints (foreign keys, uniqueness) may be violated.
  • Operational overhead: Requires a durable queue, retry logic, dead-letter handling, and monitoring for queue depth and flush lag.

When it fits vs when to avoid

Good fit: IoT telemetry, view counters, analytics events, metrics, all high-volume writes where losing a few in a rare crash is acceptable.

Avoid: Financial transactions, order processing, inventory adjustments, anything where a single lost write has business consequences.

Each pattern makes a different trade-off. Choosing the right one depends on your workload's read/write ratio, consistency requirements, and tolerance for complexity.

Summary table

DimensionCache-AsideWrite-ThroughWrite-Behind
Source of truthDatabase alwaysDatabase alwaysCache temporarily
Write latencyDB write onlyDB + cache writeCache write only (fast)
Read freshnessStale until TTL/invalidationFresh after every writeFresh in cache, stale in DB
Data loss riskNone (cache is expendable)None (DB commits first)High (pending writes in memory)
ComplexityLow (get/set/delete)Medium (dual-write ordering)High (queue, flush, retry)
Memory efficiencyHigh (only accessed keys)Lower (all written keys)Lower (all written keys + queue)

Consistency guarantees

  • Cache-aside: DB is always correct. Cache may be stale for the duration of the TTL or until explicit invalidation. Acceptable when brief staleness does not cause harm.
  • Write-through: Cache and DB are in sync after every write. Strongest immediate consistency. Use when serving a stale value has business consequences.
  • Write-behind: Cache is ahead of DB. DB is eventually consistent. Other DB readers see lagging data. Use only when you can tolerate the lag.

How to mix and match

These patterns are not mutually exclusive:

  • Use cache-aside as your default for most data
  • Add write-through for small, high-value tables (pricing, feature flags, user-critical views) where reads must always be fresh
  • Use write-behind sparingly for high-volume write streams (counters, telemetry, metrics) where DB write throughput is the bottleneck and rare data loss is acceptable
Key Insight

The most common production architecture uses all three patterns simultaneously. User profiles: cache-aside. Feature flags: write-through. Analytics counters: write-behind. The key decision is not which pattern to use, but which pattern to use for which data. Match the pattern to the data's consistency and durability requirements.

TTL determines how long a cache entry lives before it expires and must be refreshed from the source of truth. Too short and you churn the cache with constant misses. Too long and you serve stale data. Getting it right requires understanding your data's change frequency and your tolerance for staleness.

Cache key TTL expiring followed by thundering herd of concurrent database queries and stale-while-revalidate solution

Factors that drive TTL selection

  • Data volatility: Stock prices change every second (short TTL or no cache). Country code lists change yearly (TTL of days).
  • Read frequency vs TTL: If a key is read every 5 minutes but TTL is 60 seconds, most accesses are misses. The cache provides no value. TTL should cover the typical inter-access interval.
  • Staleness tolerance: Checkout prices need near-zero staleness (short TTL + invalidation). Trending feed can be 5 minutes old (longer TTL).
  • Cache write policy: With write-through, TTL is a safety net (long is fine). With cache-aside, TTL is the primary staleness control.

Risks at the extremes

TTL too short:

  • Low hit rate: most requests fall through to the database
  • Cache churn: constant insertion/eviction wastes CPU and network
  • Synchronized expiry of many keys causes thundering herd on the database

TTL too long:

  • Users see stale data (wrong prices, outdated profiles, stale feature flags)
  • Dead keys accumulate, wasting cache memory
  • Harder to debug when old data surfaces unexpectedly

Advanced TTL strategies

Jittering: Instead of TTL = 600s for all keys, use TTL = 600 + random(0, 60). This spreads expirations over time, preventing synchronized stampedes on the database.

Refresh-ahead: If a key is accessed and its remaining TTL is below a threshold (e.g., under 10 seconds), serve the current value and kick off an async background refresh. Popular keys stay warm without ever experiencing a cold miss.

Stale-while-revalidate: When a key expires, serve the stale value to all readers while exactly one request refreshes the value from the database. This eliminates the thundering herd by ensuring only one DB query per expired key.

Explicit invalidation + safety TTL: Set a long TTL (24h) but invalidate on every write. The TTL only fires if an invalidation was missed. It is the safety net, not the primary mechanism.

Interview Tip

Start simple: static data gets 1-24 hour TTL, moderately changing data gets 1-10 minutes, highly dynamic data gets seconds or no cache. Monitor hit ratio and staleness, then tune. Add jittering from day one. It costs nothing and prevents stampedes at scale.