How Metrics, Traces, and Logs Compose Into One Investigation

April 13, 2026

The pager goes off at 2 a.m. The alert says checkout-p99 > 2s for five minutes. You have ninety seconds to decide whether this is a real incident or a known-noisy alarm. The path you take through your observability stack in those ninety seconds is the whole reason logs, metrics, and traces are three things instead of one.

Start with metrics because they are the only pillar cheap enough to query at scale in real time. A metric is a counter or a histogram pre-aggregated at a fixed interval. You can ask "what is checkout-p99 over the last hour, by region" and get an answer in milliseconds because the storage already collapsed every request into a bucket. Metrics tell you that something is wrong and how bad, and they do it across the entire fleet at once.

Once a metric tells you the fleet is sick, traces tell you which hop is sick. A trace stitches together the spans every service produced for one request, with parent and child relationships preserved. The unique thing it gives you is a time breakdown across the call graph. Checkout is slow because the inventory service spent 1.4 s waiting on a Redis lookup, and Redis was queued behind a slow KEYS * from a misbehaving job. You did not need to read any code to find that. The trace painted it.

Then you go to logs, because by the time you know which service and which span, you need the exact stack trace, the exact SQL, the exact bytes that the trace cannot store. Logs are the per-event record. They are expensive because each line is real storage, but you only need to read the few hundred lines from the slow service in the affected window.

Two ways this stack betrays you in production. First, cardinality explosions in metrics. Someone tags a Prometheus counter with user_id so it is "easier to filter," and now every active user spawns a new time series. The series count goes from 50K to 30M overnight, the TSDB OOMs, and your alerting backend itself becomes the incident. High-cardinality identifiers belong in logs or traces, never in metric labels. Second, head-based trace sampling at 1%. The slow request that triggered the page has a 99% chance of having been dropped before its spans were ever collected, so you open the trace UI and see nothing. Tail-based sampling, where the collector keeps any trace whose root span exceeded a threshold, fixes this without paying to store every span.

The discipline: pick the pillar that matches the question, and resist the urge to make any one pillar do all three jobs.

Key takeaway

Each pillar answers a different question and has a different cost curve. The skill is moving from metric to trace to log in that order, and not paying twice by stuffing the wrong data into the wrong pillar.

Originally posted on LinkedIn. View original.