After uninstalling calico, new pods are stuck in container creating state

Kubernetes

Networking

Calico

Troubleshooting

Pods

After uninstalling calico, new pods are stuck in container creating state

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

The scenario of new pods becoming stuck in the "ContainerCreating" state after uninstalling Calico is a common stumbling block for Kubernetes users. Calico is a popular networking solution for Kubernetes, providing necessary capabilities like policy enforcement, network security, IP address management, and more. Once it's uninstalled improperly or without adequate precautions, it can leave the Kubernetes environment in a state where new pod scheduling encounters issues.

Understanding Calico's Role in Kubernetes

To understand this issue, it's important to depict the role Calico plays:

Network Policy Enforcement: Calico allows users to define networking policies at multiple layers. It employs BGP protocols to manage IP routes and ensure nodes on the Kubernetes network can communicate effectively.
IP Address Management (IPAM): Calico manages IP allocation using an IPAM module, controlling and allocating pod IP addresses within the cluster.
Network Plugins: Calico serves as a Container Network Interface (CNI) plugin, providing network capabilities and managing pod communication within and outside the cluster.

When Calico is uninstalled without careful planning, these features become inactive, leaving the network misconfigured.

Common Causes of Pods Stuck in ContainerCreating State

CNI Plugin Removal: Kubernetes relies on the CNI plugin to create network namespaces and allocate the necessary network resources to pods. Without Calico, there's a lack of network configuration or a conflict with a residual CNI configuration.
IPAM Module Issues: The IPAM configuration managed by Calico may leave behind residual settings, causing IP allocation errors that prevent pods from being scheduled.
Configuration Residuals: Deleting Calico might not clean up all configurations or CRDs (Custom Resources Definitions), leaving the system with incomplete or incompatible settings.

Steps to Diagnose the Issue

Check Pod Status: Use `kubectl get pods` to identify those that are stuck. Follow up with `kubectl describe pod ````<pod-name>````` to get detailed information.
Review Events and Logs: The event section of a pod describes the reasons for the pod being stuck. Use `kubectl get events --sort-by=.metadata.creationTimestamp` to view real-time errors.
Check Networking: Assess the networking state using `ip addr` and diagnose hostname details with `hostname` to verify DNS and IP link issues.
Verify CNI Configuration: Check the contents of `/etc/cni/net.d/` for stale or residual CNI configuration files that might have been left behind after Calico was removed.
Evaluate Node Networking State: Use `kubectl describe node ``<node-name>``` to inspect the node’s network configuration and status.

Resolution Steps

Once you've identified the issue via diagnosis:

Validate and Clean CNI Configurations: Ensure that the `/etc/cni/net.d/` directory only contains valid JSON configuration files for the active CNI plugin.
Reinstall Calico or Alternative CNI: Kubernetes needs a CNI plugin:
- Reinstall Calico using the official documentation and apply the correct configurations.
- Alternatively, choose another CNI like Flannel, Cilium, or Weave as per your security and networking requirement.
Reset Pod Networking:
- Delete and recreate stuck pods using `kubectl delete pod ````<pod-name>```` --grace-period=0 --force`.
- Check if re-creation solves the networking leap.
Remove CRDs or Unnecessary Configurations: If residual Custom Resource Definitions remain, clean them using: