Kubernetes
Calico
CrashLoopBackOff
Networking
Troubleshooting

Kubernetes calico node CrashLoopBackOff

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When calico-node is stuck in CrashLoopBackOff, the issue is usually at the node networking layer, not at the application layer. Calico is part of the cluster's network fabric, so the fastest path is to inspect logs and host prerequisites first instead of treating it like a normal crashing workload pod.

What calico-node Does

calico-node runs as a DaemonSet and is responsible for critical networking work on each node, including:

  • route programming
  • pod network interface setup
  • network policy enforcement
  • BGP or VXLAN behavior depending on cluster mode

If it keeps restarting, the cluster can look partially alive while pod-to-pod traffic or policy enforcement is quietly broken.

Start with Pod Status and Logs

First identify which nodes are affected and what the process is actually complaining about.

bash
kubectl -n kube-system get pods -l k8s-app=calico-node -o wide
kubectl -n kube-system logs ds/calico-node --all-containers --tail=200
kubectl -n kube-system describe pod <calico-pod-name>

The logs usually point toward a real class of failure such as:

  • missing kernel modules
  • iptables backend mismatch
  • host mount or permission problems
  • datastore or API connectivity errors

Those messages are far more useful than repeatedly deleting the pod and waiting for it to crash again.

Check Host Networking Prerequisites

Many Calico failures are really host failures. After identifying an affected node, inspect the machine directly.

bash
1lsmod | grep -E 'ip_tables|ip6_tables|xt_set'
2sysctl net.ipv4.ip_forward
3iptables -V
4nft --version
5ls /etc/cni/net.d

One common problem is inconsistent iptables behavior across nodes, especially when some hosts effectively use legacy tooling and others use nft-backed behavior.

Match the Fix to the Actual Error

The CrashLoopBackOff state is just the symptom. The remediation depends on what the logs show. Typical fixes include:

  • loading missing kernel modules
  • standardizing iptables mode across nodes
  • repairing broken CNI files
  • fixing hostPath permissions
  • using a Calico version compatible with the node kernel and Kubernetes version

After fixing the host or manifest issue, let the DaemonSet recreate the pod:

bash
kubectl -n kube-system delete pod <calico-pod-name>

That restart should come after the root cause is addressed, not instead of diagnosis.

Validate Traffic After Recovery

A Running Calico pod is a good sign, but it is not proof that networking is fully healthy. Validate actual traffic.

bash
1kubectl run -it --rm netcheck --image=busybox:1.36 --restart=Never -- sh
2# inside the pod
3nslookup kubernetes.default
4wget -qO- http://kubernetes.default.svc

If possible, also test pod-to-pod communication across different nodes. Some Calico problems recover only partially, and you do not want to stop at green pod status alone.

Compare a Healthy Node with a Broken Node

One of the fastest debugging tricks is to compare one working node with one failing node. Differences in kernel modules, iptables mode, CNI files, OS image version, or cloud-init behavior often become obvious only when you put the machines side by side.

That comparison is frequently more useful than staring at the Kubernetes manifest alone, because Calico depends heavily on host-level networking state.

Common Pitfalls

  • Repeatedly deleting the crashing Calico pod without reading the logs or checking the host.
  • Assuming the problem is fixed as soon as the pod reaches Running state.
  • Comparing only Kubernetes manifests and forgetting the node-level environment differences.
  • Treating Calico like an ordinary application workload instead of a networking component tied closely to the host.
  • Making network component upgrades casually without checking kernel and CNI compatibility.

Summary

  • 'calico-node CrashLoopBackOff is usually a node networking or host compatibility problem.'
  • Start with logs, then inspect the affected node directly.
  • Check kernel modules, iptables behavior, CNI files, and datastore or API connectivity.
  • Apply the remediation that matches the observed error instead of guessing.
  • Verify real cluster traffic after recovery, not just pod status.

Course illustration
Course illustration

All Rights Reserved.