Animated explainers on databases.
A SELECT across shards becomes a fanout plan with parallel scans, shuffles, and a coordinator merge. Tail latency is dominated by the slowest shard, and cross-shard joins quietly break linear scaling.
Caching, queueing, replication, partitioning: every system design choice bends to your database. Here is why databases shape everything downstream of them.
Event sourcing turns your database into an append-only log of facts. State becomes a function of history. The cost is schema evolution forever and replay times that grow with your business.
Writes, reads, indexing, transactions, replication, partitioning. The technology stack changes every five years, but the six core database questions never do.
A single Place Order request looks like one write. In production it fans out into a graph of cache invalidations, index updates, events, and audit logs.
OLTP, OLAP, key value, document, time series, columnar, search. Each storage engine is optimized for a specific access pattern. Match the workload first, the brand name second.
Async replication scales reads but lets users see their own writes disappear. Read-your-writes, monotonic reads, session pinning, and the failures that follow.
An inverted index flips the problem: instead of scanning documents for terms, you look up terms and get the documents. That is the whole trick behind millisecond search.
Multi-leader cuts cross-region write latency, then hands you conflicts. LWW silently loses data, app-level merge takes work, and CRDTs only fit certain shapes.
Leader follower replication is the boring default for Postgres, MySQL, and MongoDB. Sync vs async, read scaling, failover, and the lag-driven data loss waiting in production.
Shared schema, schema per tenant, database per tenant. Each model trades operational cost for blast radius. Row-level security alone does not stop the noisy neighbor.
Time-series data is append-only, write-heavy, and queried by time range. That access pattern demands columnar layout, delta-encoded timestamps, and downsampling, not a B-tree.
Embed for one-to-few, bounded, read-together data. Reference for shared or unbounded data. The 16MB document limit and write hotspots decide the rest.
Database choice is really a relationship-shape choice. Joins want relational. Aggregates want document. Traversal wants graph. Forcing the wrong one shows up as recursive CTE hell.
B+tree indexes give O(log n) reads via sorted pages, with leftmost prefix rules for composite keys and covering indexes that skip the heap. Every index also taxes every write, so prune before you add.
Nested loop, hash join, merge join. Each wins in a different shape of data. Reading EXPLAIN tells you which one the planner chose and whether it was right.
How InfluxDB, TimescaleDB, and Prometheus actually store metrics. Time-chunked files, columnar layout, downsampling, retention, and why insert-only is the whole game.
Range, hash, and list partitioning each optimize for different access patterns. The trap is picking a key that distributes writes well but destroys read locality.
Partition pruning lets the planner skip irrelevant partitions, but only when it can see the partition key in the predicate. Wrap the column in a function and the optimization disappears.
The cost-based optimizer picks scan type, join order, and join method from row count estimates. When the estimates are off by five orders of magnitude, it picks nested loops on a million rows and your query runs for 25 minutes.
WAL appends every change to a sequential log before touching data pages. fsync on that log is the durability line. Group commit, synchronous_commit, and battery-backed caches all live on this line.
Read Uncommitted, Read Committed, Repeatable Read, and Serializable each ban a specific anomaly. Postgres Repeatable Read is snapshot isolation and still allows write skew. Knowing which level stops what is the whole game.
B-trees do in-place page updates with random I/O. LSM-trees buffer in a memtable and flush sorted SSTables. The write path is where the two structures diverge.
MVCC keeps old row versions until no transaction can see them. VACUUM reclaims that space. Long transactions, idle-in-transaction sessions, and replication slots pin dead tuples and let tables bloat 10x or more.
K hash functions, one bit array, zero false negatives. Bloom filters answer maybe or definitely not, which is exactly what LSM reads, CDN caches, and crawlers need to skip expensive checks.
Memtable plus WAL plus SSTable levels. The LSM write path is O(1) sequential I/O, while reads climb levels guarded by bloom filters and sparse indexes. Compaction is the price.
MVCC gives every transaction its own snapshot by keeping multiple row versions. Postgres tuples plus VACUUM, MySQL undo log. Readers and writers stop fighting.
Partitioning bounds the working set, indexing accelerates lookups within it. You usually need both, and local vs global indexes is the question that decides whether your partitioned table is actually fast.
Atomicity stops partial commits. Consistency is an app invariant, not a DB guarantee. Isolation stops concurrent transactions from corrupting each other. Durability survives the crash. Most apps lose money at the I, not the A.
Size-tiered compaction trades space for write throughput. Leveled compaction trades write amplification for predictable reads. Universal sits in the middle. Pick by workload, not by default.
An LSM GET could scan every SSTable on every level. It does not, because two structures filter the work: a bloom filter per file and a sparse index inside each file. Both are non-negotiable.
B-trees mutate pages in place. LSM trees append immutable files and merge later. Once you see the disk shape, write amplification and read latency stop being mysterious.
B-trees amplify writes through page rewrites and the WAL. LSM trees amplify through compaction across levels. The math behind each, and why neither is free.
Redis Cluster shards by 16384 hash slots, not by keys. Here is how MOVED and ASK redirects keep clients on the right node during a live resharding, and how one bad client setting can loop traffic forever.
Redis is honest about durability. RDB, AOF, and replication solve different problems and lose different amounts of data. Here is exactly what is gone after a crash under each mode.
RDB snapshots are fast but lossy. AOF with everysec costs a second of writes on crash. Replication protects availability, not data. The mode you choose decides what you lose.
Elasticsearch has two completely different code paths. The index path builds inverted indexes through analyzer chains. The search path fans out, scores, and merges. The refresh interval is the seam between them.