Docker container live migration in kubernetes
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Kubernetes does not natively provide true live migration of a running container in the virtual-machine sense. Its model is different: instead of moving one live container with all memory and open connections intact, Kubernetes usually replaces pods, reschedules workloads, and relies on replicas, readiness, and externalized state to keep applications available.
What people usually mean by "live migration"
In a VM platform, live migration means moving a running instance from one host to another while preserving in-memory state with minimal interruption. Containers in Kubernetes are not managed that way by default.
A pod is treated as replaceable. If a node becomes unsuitable, Kubernetes generally creates a replacement pod elsewhere and terminates the old one. That works well for stateless services, but it is not the same as transferring a process image mid-flight.
So the short answer is: standard Kubernetes supports rescheduling, not transparent live migration.
Why Kubernetes favors replacement over migration
Kubernetes assumes applications should survive pod restarts and relocations through design:
- multiple replicas behind a Service
- external databases or volumes for persistent state
- readiness probes so traffic only reaches healthy pods
- rolling updates for controlled replacement
That philosophy removes the need for live migration in many cases. If one pod disappears and another becomes ready quickly, the service stays available even though no single process was migrated.
A typical deployment looks like this:
With several replicas, one pod can be drained and replaced without the service disappearing.
What is possible with checkpoint and restore
There are experimental container checkpoint and restore techniques using tools such as CRIU. At a low level, that can sometimes capture and restore process state. However, this is not the normal Kubernetes operational model, and it has serious limitations around kernel support, networking, runtime compatibility, and workload type.
Even where checkpointing exists, it is better understood as specialized runtime support than as a standard "move any pod anywhere" Kubernetes feature.
For most production teams, designing for restartability is more realistic than designing around live migration.
Practical alternatives that solve the real availability problem
If your real goal is "avoid downtime during node maintenance," use Kubernetes features that are meant for that outcome:
Combined with multiple replicas and proper readiness checks, this keeps enough pods available during voluntary disruptions.
For stateful applications, the usual answer is:
- keep state outside the container process when possible
- use PersistentVolume claims when disk-backed state is required
- handle reconnects and leader election explicitly
That gives you resilience that survives normal scheduling events instead of depending on fragile migration behavior.
Common Pitfalls
The biggest mistake is assuming Kubernetes can move a running container with open sockets and memory state the same way a hypervisor can move a VM. That is not the default contract of the platform.
Another mistake is asking for live migration when the real problem is insufficient redundancy. If one replica going down causes downtime, the issue is usually service design or deployment strategy, not missing migration support.
Teams also underestimate how much runtime and kernel detail checkpoint-and-restore depends on. Even when it works in a lab, it may not generalize cleanly to mixed production workloads.
Finally, do not store critical transient state only inside one pod if high availability matters. Kubernetes is strongest when pods are replaceable.
Summary
- Kubernetes does not natively do true live migration of running containers.
- Its normal approach is replacement and rescheduling, not moving in-memory process state.
- High availability usually comes from replicas, readiness probes, Services, and externalized state.
- Checkpoint and restore exists in specialized scenarios but is not a standard general-purpose migration workflow.
- Design workloads to tolerate restarts instead of depending on container live migration.

