System Design Fundamentals
Networking & APIs
Storage & Data Modeling
Partitioning, Replication & Consistency
Caching & Edge
Messaging & Streaming
Security & Privacy
Observability Basics for Modern Systems
A monolithic application running on one server is easy to debug. You SSH in, read the log file, check CPU usage, and find the problem. Microservices shatter that simplicity. A single user request might traverse 10 services across 50 containers in 3 data centers. When that request fails, the log file on one server tells you almost nothing.
Observability is the ability to understand what is happening inside your system by examining its external outputs — without deploying new code or adding ad-hoc debugging. If monitoring tells you "the database is slow," observability tells you "the database is slow because tenant X ran a full-table scan on the orders table, which caused lock contention that cascaded into 500ms p99 latency for the checkout service."

Monitoring vs. Observability
Monitoring is a subset of observability. Monitoring asks predefined questions: "Is CPU above 80%? Is error rate above 1%?" You set thresholds, create alerts, and get paged when they fire. This works for known failure modes.
Observability handles the unknown unknowns. When a new failure mode appears — one you have never seen before and have no alert for — observability lets you explore the system's telemetry to diagnose the root cause. You ask questions you did not anticipate when building the dashboards.
The distinction matters because microservices generate novel failure modes constantly. A new deployment in Service A changes a response payload, which causes a deserialization error in Service B, which triggers retries that overwhelm Service C. No predefined alert catches this cascade because no one predicted it. Observability gives you the tools to trace the cascade after it happens.
The Three Telemetry Signals
Observability is built on three complementary data types:
Metrics answer "what is happening right now?" — request rate, error percentage, latency percentiles, CPU utilization. Metrics are numeric time-series data, cheap to store, and ideal for dashboards and alerts.
Logs answer "why did this specific thing happen?" — structured records of individual events with context (request ID, user ID, error message, stack trace). Logs are high-volume and expensive to store but essential for debugging specific incidents.
Traces answer "how did this request flow through the system?" — a trace follows a single request across service boundaries, showing which services were called, in what order, and how long each step took. Traces are the only signal that reveals cross-service dependencies and latency bottlenecks.
Each signal alone is incomplete. Metrics tell you latency is high but not why. Logs tell you an error occurred but not the request path that triggered it. Traces show the request path but not the system-wide impact. Combined, they give you full diagnostic capability.
Why Observability Is Non-Negotiable for Microservices
In a monolith, a stack trace shows the full execution path from request to response. In microservices, the execution path spans multiple processes on multiple machines. Without distributed tracing, you cannot reconstruct the path. Without correlated logs, you cannot see what happened at each step. Without metrics, you cannot detect the problem in the first place.
The operational cost of poor observability is measured in Mean Time to Resolution (MTTR). Teams with strong observability resolve incidents in minutes. Teams without it spend hours reproducing issues, correlating timestamps across services, and guessing at root causes. The difference between "we found the issue in 5 minutes" and "we spent 4 hours debugging" is usually not engineering skill — it is telemetry quality.
The Cost of Instrumenting Later
Teams often treat observability as something to add after the system works. This is a mistake. Retrofitting instrumentation into 30 services requires touching every service's code, coordinating deployments, and migrating from inconsistent log formats to structured logging.
Instrument from day one:
- Add structured logging with trace_id propagation as part of the service template
- Configure OpenTelemetry auto-instrumentation in the deployment pipeline
- Define RED metrics for every service endpoint before launch
- Set up a baseline dashboard before the service takes production traffic
The cost of adding observability to a new service is minimal — a few lines of configuration. The cost of retrofitting it across an existing fleet of services is weeks of engineering time. Services without observability accumulate "debugging debt" that compounds every time an incident requires manual investigation.
Never postpone observability until after launch. A service without instrumentation in production is a black box. When it fails (and it will), you will spend hours guessing at root causes instead of minutes tracing them. Treat observability setup as a launch requirement, not a post-launch improvement.
Each pillar has distinct characteristics, storage costs, and query patterns. Understanding the trade-offs helps you choose what to instrument heavily and what to sample.
Metrics — Numeric Time-Series Data
Metrics are aggregated measurements over time intervals. They answer quantitative questions: "How many requests per second? What is the 99th percentile latency? How much memory is free?"
Four fundamental metric types:
Counter — A monotonically increasing value. Total requests served, total errors, total bytes transferred. You never set a counter; you only increment it. The rate of change (requests per second) is computed at query time.
Gauge — A value that goes up and down. Current CPU usage, memory utilization, active connections, queue depth. A gauge is a snapshot of the current state.
Histogram — A distribution of values across configurable buckets. Request latency measured in buckets (0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+). Histograms enable percentile calculations (p50, p95, p99) without storing every individual measurement.
Summary — Like a histogram but computes percentiles client-side. Less flexible for aggregation across instances but more accurate for individual service percentiles.
Metrics are cheap to store because they are pre-aggregated. A counter that tracks HTTP requests stores one number per time interval regardless of how many requests occurred. This makes metrics ideal for dashboards and real-time alerting at scale.

Logs — Structured Event Records
Logs capture individual events with full context. Unlike metrics (which aggregate), logs preserve the details of each occurrence.
Structured logs use a consistent format (JSON) with standardized fields:
Structured logs are machine-parseable. You can filter by service=payment-service AND level=ERROR AND error_code=INSUFFICIENT_FUNDS to find every declined payment. Unstructured logs ("Payment failed for user 42") require regex parsing that breaks when the format changes.
Log correlation ties logs across services. When every service includes the same trace_id in its log entries, you can query "show me all log entries from all services for this specific request." This is the bridge between logs and traces — the trace shows the request flow, and the logs show what happened at each step.
The cost challenge: a busy service generating 10,000 requests per second with 3 log lines per request produces 30,000 log entries per second — 2.6 billion per day. At 1KB per entry, that is 2.6TB per day of log storage. Log sampling (store 10% of successful requests, 100% of errors) reduces cost while preserving diagnostic value.
Log Levels and When to Use Each
Consistent log levels across all services prevent noise and enable filtering:
- ERROR — Something broke and requires attention. A payment failed, a database connection was refused, an unhandled exception occurred. Alert-worthy.
- WARN — Something unexpected happened but the system recovered. A retry succeeded, a cache miss fell back to the database, a deprecated API was called. Review in dashboards, do not alert.
- INFO — Normal operational events. A request was handled, a job completed, a config was loaded. The default level for production.
- DEBUG — Detailed execution flow for development. Variable values, branch decisions, intermediate state. Never enable in production unless actively debugging — the volume will overwhelm storage.
The rule: production systems run at INFO level. ERROR and WARN logs are always stored (100%). INFO logs are sampled based on budget. DEBUG is disabled entirely in production and enabled temporarily via a feature flag when investigating specific issues.
Metric Cardinality — The Hidden Cost
Cardinality is the number of unique time series a metric produces. A metric with labels {method, status} might produce 20 combinations (5 methods times 4 status codes). That is manageable.
But adding a user_id label to the same metric explodes cardinality. If you have 1 million users, that single label creates 1 million time series — multiplied by every method and status combination. Prometheus stores each time series independently, so high cardinality consumes memory, slows queries, and can crash the server.
Rules for label safety:
- Bounded labels are safe: HTTP method (GET, POST, PUT, DELETE), status code class (2xx, 3xx, 4xx, 5xx), service name, region, environment
- Unbounded labels are dangerous: user_id, request_id, email address, IP address, order_id, URL path with variable segments
- URL paths need normalization:
/api/users/123and/api/users/456should both map to/api/users/:id. Without normalization, every unique user ID creates a new time series
If you need per-user breakdown, use logs (which handle high cardinality natively) or traces (which attach user_id as a span attribute). Reserve metrics for aggregate, bounded dimensions only.
Traces — Distributed Request Paths
A trace follows a single request from entry point to completion across all services it touches. Each step in the trace is a span with a start time, duration, and metadata.
This waterfall view immediately shows that the payment service (385ms) dominates the total request time (450ms), and within the payment service, the Stripe API call (330ms) is the bottleneck.
Context propagation is the mechanism that makes tracing work across services. When Service A calls Service B, it passes trace context (trace ID, span ID, sampling flags) in HTTP headers or gRPC metadata. Service B creates a child span linked to Service A's span. This chaining produces the tree structure above.
The standard context propagation format is W3C Trace Context, which uses two headers:
Without context propagation, spans from different services are orphaned — you see individual service timings but cannot reconstruct the request flow.
Span Attributes — Making Traces Searchable
A span without attributes tells you "the payment service took 300ms." A span with attributes tells you "the payment service processed order #789 for user_42, charged $129.99 via Stripe, and succeeded on the second retry after a timeout."
Essential span attributes:
- Business context: order_id, user_id, product_id, tenant_id
- Operation details: HTTP method, URL path, database query type, queue name
- Outcome: status code, error message, retry count
- Infrastructure: pod name, node, availability zone
Unlike metric labels, span attributes can have high cardinality because traces are individual records, not aggregated time series. Attaching user_id to a span is safe and useful. Attaching user_id to a metric label is dangerous.
Set attributes as early as possible in the span lifecycle. If a span errors before you set attributes, you lose the debugging context. A common pattern is setting business attributes immediately after parsing the request, and outcome attributes in a deferred block that runs on both success and failure.
The three pillars are not independent data stores you query separately. The power comes from correlation. A metric alert fires (error rate spike), you filter traces for the affected time window and endpoint, find a slow trace, then pull the logs for that trace ID. The trace_id is the join key that connects all three pillars into a single diagnostic workflow.
Collecting every possible metric creates dashboard sprawl and alert fatigue. RED and USE are frameworks that focus your instrumentation on the metrics that actually matter for diagnosing problems.

RED — For Services (User-Facing)
RED focuses on the service from the caller's perspective:
Rate — Requests per second. How much traffic is this service handling? A sudden drop in rate might mean an upstream service stopped calling, or a load balancer misconfiguration is routing traffic elsewhere.
Errors — Failed requests per second (or error percentage). What fraction of requests are failing? Distinguish between client errors (4xx, the caller's fault) and server errors (5xx, your fault). Alert on 5xx rate, not 4xx.
Duration — Latency distribution (p50, p95, p99). How long do requests take? The p50 shows typical experience, the p99 shows worst-case. A p99 that is 10x the p50 indicates a bimodal distribution — most requests are fast, but some hit a slow path.
RED answers "are users having a good experience with this service?" If rate is stable, errors are low, and duration is within SLA, the service is healthy from the user's perspective.
USE — For Resources (Infrastructure)
USE focuses on physical and logical resources:
Utilization — What percentage of the resource capacity is consumed? CPU at 85%, disk at 70%, connection pool at 90%.
Saturation — How much excess work is queued? Queue depth, wait time, number of threads blocked waiting for a lock. A resource can be 50% utilized but 100% saturated if all incoming work is serialized through a single lock.
Errors — Resource-level errors. Disk I/O errors, network packet drops, out-of-memory kills. These are different from application errors — they indicate hardware or kernel-level problems.
USE answers "is this resource the bottleneck?" Apply USE to every resource: CPU, memory, disk, network, connection pools, thread pools, database connections.
Concrete Metric Names
When implementing RED and USE, use consistent naming. Prometheus conventions:
Utilization is computed as a ratio at query time: db_connections_active / db_connections_max. This is more flexible than storing a pre-computed percentage because the max can change dynamically (e.g., after scaling connection pool size).
Common Anti-Patterns
Vanity metrics — tracking total requests served "all time" without rate calculation. A counter that says "10 million requests" is meaningless. The rate of change (requests per second) is what matters for capacity planning and incident detection.
Alert on every USE dimension — alerting when CPU exceeds 80%, AND memory exceeds 70%, AND disk exceeds 60%. Each alert fires independently, creating an alert storm when a single event (like a batch job) temporarily elevates all resources. Instead, alert on the RED symptom (latency or error rate) and use USE metrics only for diagnosis.
Missing saturation metrics — tracking utilization without saturation. A thread pool at 50% utilization looks healthy, but if every incoming request waits 2 seconds for a thread, the pool is fully saturated. Always pair utilization with queue depth or wait time.
Combining RED and USE
RED and USE are complementary. RED tells you "the checkout service has high latency." USE tells you "the database server's CPU is at 98% utilization." Together: "the checkout service is slow because the database is CPU-bound."
The diagnostic flow:
- RED alert fires: checkout service p99 latency above 500ms
- Check RED metrics for checkout's dependencies: payment service RED looks normal, database calls show high duration
- Check USE metrics for the database: CPU utilization at 95%, saturation (connection queue depth) is growing
- Root cause: a heavy batch job is consuming database CPU, starving the checkout service's queries
A second example showing the reverse direction:
- USE alert fires: connection pool utilization at 100% on the API servers
- Check RED metrics for the API service: rate is normal, errors are increasing (connection timeout), duration p99 is climbing
- Check RED metrics for downstream services: the recommendation service has high duration (3 seconds average, normally 200ms)
- Check USE for the recommendation service host: memory utilization at 99%, saturation shows frequent garbage collection pauses
- Root cause: a memory leak in the recommendation service is causing GC pauses that slow responses, which holds open connections on the API server, exhausting the pool
This cascading investigation — RED to USE to RED to USE — is the standard diagnostic pattern. Each framework narrows the search space until you reach the root cause. The key discipline is following the data rather than guessing.
In interviews, state that you use RED for services and USE for resources. This shows you understand the difference between application-level health (RED) and infrastructure-level health (USE). Most candidates only mention one or describe ad-hoc metric selection.
Concepts need tooling. The observability ecosystem has converged around a few dominant stacks, but the architectural pattern is the same regardless of vendor: instrument, collect, store, query, alert.
The Metrics Stack: Prometheus + Grafana
Prometheus is a pull-based metrics system. It scrapes HTTP endpoints (usually /metrics) on your services at regular intervals (typically every 15 seconds). Your service exposes metrics in Prometheus format; Prometheus stores them in a local time-series database.
Grafana queries Prometheus using PromQL and renders dashboards. A typical RED dashboard for a service:
Prometheus is designed for reliability, not long-term storage. It stores 15-30 days of data locally. For long-term retention, use Thanos or Cortex, which provide a Prometheus-compatible API backed by object storage (S3).
Thanos works by adding a sidecar to each Prometheus instance that uploads blocks to object storage. A Thanos Query component provides a unified PromQL interface across all Prometheus instances and historical data. This architecture keeps Prometheus simple (local storage only) while enabling months or years of metric retention.
Cortex takes the opposite approach: services push metrics to Cortex, which handles storage, replication, and multi-tenancy. Cortex is more complex to operate but provides a fully managed, horizontally scalable metrics backend.
Both are necessary because capacity planning and SLO reporting require historical data. You cannot calculate "99.9% availability over 30 days" if Prometheus only retains 15 days of data.
The Logging Stack: Structured Logging + Aggregation
Two dominant architectures:
ELK (Elasticsearch, Logstash, Kibana) — Logstash collects and parses logs, Elasticsearch indexes them for full-text search, Kibana provides the query UI. Powerful but operationally heavy — Elasticsearch clusters require significant memory and careful index management.
Loki + Grafana — Loki indexes only log metadata (labels like service, level, pod) and stores log content in object storage. Queries filter by labels first, then grep the content. Much cheaper than ELK because it avoids full-text indexing, but queries on log content are slower.
Both stacks benefit from structured logging. If your logs are JSON with consistent fields, parsing is trivial. If your logs are unstructured text, every format change requires updating parse rules.
Regardless of stack, enforce a standard set of log fields across all services:
timestampin ISO 8601 format (with timezone)level(ERROR, WARN, INFO, DEBUG)service(the service name emitting the log)trace_idandspan_id(for correlation with traces)message(human-readable description)error_code(machine-parseable error identifier, if applicable)
Standardized fields enable cross-service queries without knowing each service's internal log format. You can query "show all ERROR logs with trace_id X across all services" in a single search.
The Tracing Stack: OpenTelemetry + Jaeger
OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. OTel provides SDKs for every major language that auto-instrument HTTP clients, database drivers, and messaging libraries. One instrumentation, multiple backends.
Jaeger stores and visualizes traces. It provides the waterfall view that shows span timing, and supports querying traces by service, operation, duration, or tag values.
The OTel Collector acts as a pipeline between your services and backends. Services send telemetry to the collector, which can batch, filter, sample, and route data to multiple backends simultaneously (send metrics to Prometheus, logs to Loki, traces to Jaeger).
Auto-Instrumentation vs. Manual Instrumentation
OTel provides auto-instrumentation libraries that hook into common frameworks and libraries without code changes. In Java, a single -javaagent JVM flag instruments every HTTP call, database query, and gRPC request automatically. In Python, opentelemetry-instrument wraps WSGI/ASGI servers to capture every incoming request.
Auto-instrumentation covers the "infrastructure" spans — HTTP client/server, database, message queue. But it cannot capture business context. It does not know that this request is processing order #789, or that the user is a premium-tier customer.
Manual instrumentation adds business context:
The best practice is to start with auto-instrumentation for baseline visibility, then add manual spans for critical business operations where you need granular timing and context.
Alerting Rules in Prometheus
Prometheus alerting rules evaluate PromQL expressions at regular intervals and fire alerts when conditions are met:
The for clause prevents alerting on transient spikes — the condition must hold for the specified duration before the alert fires. This is critical for reducing false positives from brief deployment hiccups or garbage collection pauses.

Tools are half the story. How you use them determines whether you can find and fix issues in minutes or hours.
Define SLOs Before Building Dashboards
A Service Level Objective (SLO) is a target for user-visible behavior: "99.9% of checkout requests complete within 500ms." An SLI (Service Level Indicator) is the metric that measures it: the percentage of checkout requests with latency under 500ms over a rolling 30-day window.
SLOs focus your observability. Instead of dashboarding every metric, instrument the SLIs that map to user experience. If the SLO for checkout latency is met, the system is healthy from the user's perspective — regardless of how noisy the internal metrics look. Alert on SLO burn rate (how fast you are consuming your error budget), not on individual metric thresholds.
Alert on Symptoms, Not Causes
Bad alert: "CPU is above 80%." CPU at 80% might be perfectly fine if latency is normal. Good alert: "p99 latency for the checkout service is above 500ms for 5 minutes." This is a symptom that users are experiencing. The CPU metric helps diagnose the cause after the symptom alert fires.
Symptom-based alerting dramatically reduces alert fatigue. A single "latency is high" alert replaces dozens of cause-based alerts (CPU, memory, disk, connections, queue depth) that may or may not affect users. When the symptom alert fires, the engineer investigates using dashboards and traces rather than being bombarded with a cascade of cause alerts.
Propagate Context Everywhere
Every service must propagate trace context (trace_id, span_id) on every outbound call — HTTP, gRPC, message queues, async jobs. If one service in a chain of 10 drops the context, the trace breaks and you lose visibility into everything downstream.
Include trace_id in every log entry. This is the single most valuable observability improvement you can make. It turns disconnected log files into a correlated timeline of a single request. Most logging frameworks support MDC (Mapped Diagnostic Context) or equivalent, which automatically includes the trace_id in every log line for the duration of a request.
Design Dashboards for Diagnosis, Not Decoration
A dashboard should answer a specific question in under 10 seconds. The top-level dashboard shows RED metrics for each service. Clicking a service opens its detail dashboard with latency breakdown by endpoint, error breakdown by status code, and dependency health. Clicking a dependency shows USE metrics for that resource.
This drill-down structure matches the diagnostic workflow: detect (RED overview), localize (service detail), diagnose (resource USE). Every panel should have a purpose. If a panel has not been looked at during the last 3 incidents, remove it.
Sample Intelligently
At high volume, storing 100% of traces and logs is cost-prohibitive. Head-based sampling (decide at the entry point whether to sample this request) is simple but misses interesting requests. Tail-based sampling (decide after the request completes, based on duration or error status) captures every slow or failed request regardless of sampling rate. Tail-based sampling requires buffering all spans until the trace completes, which adds memory overhead to the collector, but ensures you never miss an interesting trace.
The single highest-impact observability improvement is adding trace_id to every log entry. This one change turns hours of cross-service log correlation into a single query. It costs almost nothing to implement (one middleware change per service) and immediately improves MTTR for every incident.