Kubernetes calico node CrashLoopBackOff

Kubernetes

Calico

CrashLoopBackOff

Networking

Troubleshooting

Kubernetes calico node CrashLoopBackOff

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When calico-node is stuck in CrashLoopBackOff, the issue is usually at the node networking layer, not at the application layer. Calico is part of the cluster's network fabric, so the fastest path is to inspect logs and host prerequisites first instead of treating it like a normal crashing workload pod.

What `calico-node` Does

calico-node runs as a DaemonSet and is responsible for critical networking work on each node, including:

route programming
pod network interface setup
network policy enforcement
BGP or VXLAN behavior depending on cluster mode

If it keeps restarting, the cluster can look partially alive while pod-to-pod traffic or policy enforcement is quietly broken.

Start with Pod Status and Logs

First identify which nodes are affected and what the process is actually complaining about.

bash

kubectl -n kube-system get pods -l k8s-app=calico-node -o wide
kubectl -n kube-system logs ds/calico-node --all-containers --tail=200
kubectl -n kube-system describe pod <calico-pod-name>

The logs usually point toward a real class of failure such as:

missing kernel modules
iptables backend mismatch
host mount or permission problems
datastore or API connectivity errors

Those messages are far more useful than repeatedly deleting the pod and waiting for it to crash again.

Check Host Networking Prerequisites

Many Calico failures are really host failures. After identifying an affected node, inspect the machine directly.

bash

1lsmod | grep -E 'ip_tables|ip6_tables|xt_set'
2sysctl net.ipv4.ip_forward
3iptables -V
4nft --version
5ls /etc/cni/net.d

One common problem is inconsistent iptables behavior across nodes, especially when some hosts effectively use legacy tooling and others use nft-backed behavior.

Match the Fix to the Actual Error

The CrashLoopBackOff state is just the symptom. The remediation depends on what the logs show. Typical fixes include:

loading missing kernel modules
standardizing iptables mode across nodes
repairing broken CNI files
fixing hostPath permissions
using a Calico version compatible with the node kernel and Kubernetes version

After fixing the host or manifest issue, let the DaemonSet recreate the pod:

bash

kubectl -n kube-system delete pod <calico-pod-name>

That restart should come after the root cause is addressed, not instead of diagnosis.

Validate Traffic After Recovery

A Running Calico pod is a good sign, but it is not proof that networking is fully healthy. Validate actual traffic.

bash

1kubectl run -it --rm netcheck --image=busybox:1.36 --restart=Never -- sh
2# inside the pod
3nslookup kubernetes.default
4wget -qO- http://kubernetes.default.svc

If possible, also test pod-to-pod communication across different nodes. Some Calico problems recover only partially, and you do not want to stop at green pod status alone.

Compare a Healthy Node with a Broken Node

One of the fastest debugging tricks is to compare one working node with one failing node. Differences in kernel modules, iptables mode, CNI files, OS image version, or cloud-init behavior often become obvious only when you put the machines side by side.

That comparison is frequently more useful than staring at the Kubernetes manifest alone, because Calico depends heavily on host-level networking state.

Common Pitfalls

Repeatedly deleting the crashing Calico pod without reading the logs or checking the host.
Assuming the problem is fixed as soon as the pod reaches Running state.
Comparing only Kubernetes manifests and forgetting the node-level environment differences.
Treating Calico like an ordinary application workload instead of a networking component tied closely to the host.
Making network component upgrades casually without checking kernel and CNI compatibility.

Summary

'calico-node CrashLoopBackOff is usually a node networking or host compatibility problem.'
Start with logs, then inspect the affected node directly.
Check kernel modules, iptables behavior, CNI files, and datastore or API connectivity.
Apply the remediation that matches the observed error instead of guessing.
Verify real cluster traffic after recovery, not just pod status.

Kubernetes calico node CrashLoopBackOff

Master System Design with Codemia

Introduction

What calico-node Does

Start with Pod Status and Logs

Check Host Networking Prerequisites

Match the Fix to the Actual Error

Validate Traffic After Recovery

Compare a Healthy Node with a Broken Node

Common Pitfalls

Summary

What `calico-node` Does