Prometheus
Kubernetes
Monitoring
Pods
Metrics

How to get number of pods running in prometheus

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Counting running pods in Prometheus is a common Kubernetes monitoring task for capacity planning and alerting. The most reliable source is usually kube-state-metrics, which exposes pod phase metrics. This guide shows practical PromQL queries for running pod counts at cluster, namespace, and workload scope.

Base Metric for Pod Phase

A standard metric is kube_pod_status_phase with labels including namespace, pod, and phase.

To count running pods cluster-wide:

promql
sum(kube_pod_status_phase{phase="Running"} == 1)

This works because each pod phase metric is represented as one or zero.

Count Running Pods by Namespace

Break down counts per namespace for operational dashboards.

promql
sum by (namespace) (kube_pod_status_phase{phase="Running"} == 1)

This query is useful for multi-team clusters where ownership follows namespace boundaries.

Count Running Pods for Specific Workloads

Use pod label joins or naming conventions depending on metrics available.

promql
1sum by (namespace) (
2  kube_pod_status_phase{phase="Running"} == 1
3  and on (namespace, pod)
4  kube_pod_labels{label_app="payments"}
5)

Adjust label keys to match your cluster labeling standards.

Exclude Completed Job Pods

If you want only active service pods, exclude succeeded and failed job-like phases by focusing strictly on running phase as above. For broader availability views, pair with pending and failed queries.

promql
sum(kube_pod_status_phase{phase="Pending"} == 1)
sum(kube_pod_status_phase{phase="Failed"} == 1)

This gives context around rollout health beyond running count.

Use in Grafana Panels

In Grafana, create:

  • Single stat panel for total running pods.
  • Time series panel for namespace trends.
  • Alert panel for sudden drops in critical namespaces.

Use dashboard variables for namespace or app labels to make panels reusable across environments.

Alert Rule Example

Define an alert when running pods drop below expected threshold.

yaml
1groups:
2  - name: pod-health
3    rules:
4      - alert: LowRunningPodsPayments
5        expr: sum by (namespace) (
6                kube_pod_status_phase{phase="Running"} == 1
7                and on (namespace, pod) kube_pod_labels{label_app="payments"}
8              ) < 3
9        for: 5m
10        labels:
11          severity: warning
12        annotations:
13          summary: "Running pod count below threshold for payments"

Tune threshold and duration by workload sensitivity.

Compare Desired Versus Running Pods

Running count alone does not show whether workloads are underprovisioned. Pair running pods with desired replica metrics for better alerting.

promql
sum(kube_deployment_spec_replicas{namespace="payments"})
-
sum(kube_deployment_status_replicas_available{namespace="payments"})

If this gap is positive for sustained periods, rollout or scheduling issues may exist.

Use range queries to monitor pod count behavior over time.

promql
avg_over_time(sum by (namespace) (kube_pod_status_phase{phase="Running"} == 1)[1h:])

Trend views help distinguish normal deployment fluctuations from persistent capacity problems.

Multi-Cluster Dashboards

If you federate metrics from several clusters, include a cluster label in aggregations.

promql
sum by (cluster, namespace) (kube_pod_status_phase{phase="Running"} == 1)

Without cluster dimension, counts can be misleading and hide issues localized to one environment.

Namespace Alert Tuning

Define separate thresholds per namespace based on workload criticality. Core platform namespaces usually need stricter availability alerts than batch-processing namespaces with elastic scheduling behavior.

Include deployment events on dashboards so pod-count drops can be correlated with rollouts, node maintenance, or autoscaler actions.

Review pod count alerts after each major scaling-policy change.

Common Pitfalls

A common pitfall is querying metrics that are unavailable because kube-state-metrics is not deployed or not scraped.

Another issue is counting all pod phases and overestimating active workload capacity.

Developers also rely on inconsistent labels, causing workload-specific queries to return partial counts.

A final mistake is alerting on instantaneous dips without for duration, which creates noisy alerts during routine rollouts.

Summary

  • Use kube_pod_status_phase to count running pods reliably.
  • Aggregate by namespace for multi-team operational visibility.
  • Filter by pod labels for workload-specific counts.
  • Pair running count with pending and failed metrics for context.
  • Add duration-based alerts to avoid rollout noise.

Course illustration
Course illustration

All Rights Reserved.