kubetnetes cluster in Azure AKS upgrade 1.24.9 in fail state with pods facing intermittent DNS issues
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Overview of AKS Upgrades and Common Issues
Azure Kubernetes Service (AKS) provides managed Kubernetes clusters, which abstracts the underlying complexity of maintaining an orchestrated container environment. Regular upgrades are essential for maintaining security, integrating new features, and ensuring compatibility. However, upgrades can occasionally encounter problems, such as the AKS version 1.24.9 upgrade facing a known issue where pods experience intermittent DNS problems.
Background of the Issue
When a Kubernetes cluster on AKS is upgraded to version 1.24.9, some users have reported that pods intermittently face DNS resolution issues. This typically manifests as:
- Pods being unable to resolve external or internal DNS names.
- Degraded application functionality due to failed service name resolution.
- Network timeouts or errors logged in the application.
Technical Explanation
Root Cause Analysis
The cause of intermittent DNS issues largely centers around DNS resolution configurations and changes introduced in the upgrade:
- CoreDNS Misconfiguration: If CoreDNS, the DNS service running within the cluster, is incorrectly configured or incompatible with the new version, DNS resolution failures can occur. For instance, deprecated plugins or incompatible settings from previous versions may lead to failures.
- Flannel CNI Issues: If Kubernetes networking changes in the new version conflict with existing configurations, it could lead to sporadic connectivity issues.
- Pod Resource Limits: Pods might not have proper CPU or memory limits leading to CoreDNS being throttled or evicted, which results in intermittent DNS failures.
Examples
Consider a debug session where a pod is experiencing DNS resolution failure:
- Verify that the upgrade did not leave any nodes running an older, incompatible version. Use the command:
- Check CoreDNS configuration for deprecated plugins or incorrect settings:
- Confirm that the CNI plugin is correctly configured and compatible. For Flannel, check with:
- Ensure adequate resource limits and requests for CoreDNS:
- Validate VNET (Virtual Network) settings and inter-node connectivity to ensure they adhere to network security requirements.
- Staging Environment Testing: Always test upgrades in a lower-tier environment that closely mimics production before upgrading production clusters.
- Automated Monitoring: Implement automated alerts to monitor DNS resolution time and error rates to catch issues early.
- Regular Backups: Have regular etcd backups to restore the cluster state if an upgrade goes wrong.

