Kubernetes - kube-system pods in master node keep restarting after worker node joins
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If kube-system pods on the control-plane node keep restarting after a worker joins, the cluster usually has a configuration mismatch rather than a random failure. New node admission can trigger networking, DNS, or resource pressure paths that were previously idle. A structured investigation quickly narrows the fault domain.
First Triage Commands
Start by checking restart patterns and event reasons.
Look for repeated probe failures, image pull errors, or crash loops tied to specific components such as CoreDNS, kube-proxy, or CNI agents.
Validate Node Readiness and Runtime Settings
When a worker joins, mismatched runtime settings can destabilize control-plane add-ons.
Check alignment of:
- cgroup driver between kubelet and container runtime
- Kubernetes minor versions across nodes
- container runtime health and disk pressure
A mismatch can generate cascading restarts in networking and DNS pods.
CNI and Cluster Networking Checks
Many restart storms start in networking. Confirm that CNI pods are healthy on all nodes and that pod CIDR settings are consistent.
If worker node routes are wrong or CNI config files differ, control-plane components may lose service connectivity and restart.
DNS and API Connectivity
CoreDNS often restarts when it cannot reach the API server or upstream dependencies.
Also verify control-plane node firewall and security group rules. Worker join procedures sometimes modify networking policies that unintentionally affect internal service traffic.
Resource Pressure and Evictions
A new node can rebalance workloads and expose tight resource limits on control-plane services.
Review requests and limits for critical system pods. If memory limits are too low, pods may enter restart loops under transient load.
Recovery Approach
A practical recovery sequence:
- cordon the new worker node
- stabilize networking and kube-system pods
- validate component logs are clean
- uncordon and monitor restarts
Do not repeatedly delete system pods without root cause analysis, or the issue will recur.
Certificates and Time Synchronization
Control-plane restart loops can also come from TLS validation problems that appear when a new node joins and starts API communication. Verify node clocks are synchronized and check certificate expiration windows.
If clocks drift significantly, certificate validation and token checks can fail intermittently. Also inspect kubelet logs on both control-plane and worker nodes for repeated auth or TLS handshake errors.
Treat time sync as a baseline dependency. Even correctly configured networking cannot stabilize the cluster if time and certificates are inconsistent.
Common Pitfalls
- Focusing only on the restarting pod instead of cluster-wide events.
- Ignoring version skew between worker and control-plane components.
- Overlooking CNI plugin logs and route state.
- Treating CoreDNS failures as standalone DNS issues when API connectivity is the root cause.
- Restarting components repeatedly without preserving logs for diagnosis.
Summary
kube-systemrestart loops after node join usually indicate configuration or networking mismatch.- Start with events, pod descriptions, and node health checks.
- Validate cgroup driver, runtime, CNI, and version alignment.
- Check DNS and API server connectivity from system components.
- Stabilize first, then reintroduce the new worker with monitoring.

