diff --git a/docs/en/solutions/virt_launcher_Pods_Remain_in_Completed_State_After_Live_Migration.md b/docs/en/solutions/virt_launcher_Pods_Remain_in_Completed_State_After_Live_Migration.md new file mode 100644 index 0000000..2c35976 --- /dev/null +++ b/docs/en/solutions/virt_launcher_Pods_Remain_in_Completed_State_After_Live_Migration.md @@ -0,0 +1,122 @@ +--- +kind: + - Information +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Overview + +After several successive live migrations of a VirtualMachine, the cluster accumulates multiple `virt-launcher--` pods in `Completed` state alongside the single `Running` pod that currently hosts the VM: + +```text +virt-launcher-fedora-skilled-bonobo-dljkv 0/1 Completed 0 12m +virt-launcher-fedora-skilled-bonobo-k6t92 1/1 Running 0 11s +virt-launcher-fedora-skilled-bonobo-lx2th 0/1 Completed 0 54s +virt-launcher-fedora-skilled-bonobo-qdkmk 0/1 Completed 0 84s +virt-launcher-fedora-skilled-bonobo-qk5qt 0/1 Completed 0 33s +``` + +This is the expected behaviour of the current KubeVirt implementation — not a failure mode — but it can look alarming on a busy VM page and has one real operational consequence around vTPM PVCs. + +## Root Cause + +Each live migration spins up a **target** `virt-launcher` pod before the source hands off the running VM. Once the handoff completes, the source pod terminates gracefully and enters `Completed`. KubeVirt intentionally does **not** garbage-collect the source pod: operators may need the source's logs to diagnose a migration that later turns out to have degraded the workload, and the Completed pod holds a reference to any PVCs the source had mounted. + +KubeVirt does garbage-collect the `VirtualMachineInstanceMigration` objects (keeping only the most recent few), and the Completed pods are all torn down automatically when the VMI itself is deleted (VM stop). A VM that migrates N times during its lifetime will therefore retain N-1 Completed pods until it is stopped. + +The one side-effect worth acting on is **vTPM**. A vTPM persistent-state PVC is declared `ReadWriteOnce` because the vTPM state must be owned by a single live QEMU process at a time. While a Completed source pod still exists, it still holds a `VolumeAttachment` reference, and the PVC cannot finalise a `Terminating` state. This is visible as a PVC that appears stuck deleting. + +## Resolution + +Decide between leaving the pods alone (recommended for steady-state clusters) and automating a cleanup. + +1. **Leave them be for short-lived VMs.** A VM that is destined to be stopped anyway will cleanse itself when stopped; no action is needed. + +2. **Schedule a cleanup CronJob for long-lived VMs** that migrate frequently (HPA-driven workloads, GPU partners with heavy NUMA rebalancing). Keep the Job RBAC narrow — only the verbs needed to delete pods in the VM namespaces: + + ```yaml + apiVersion: batch/v1 + kind: CronJob + metadata: + name: virt-launcher-completed-gc + namespace: virt-launcher-gc + spec: + schedule: "17 */6 * * *" # every 6 hours at :17 + concurrencyPolicy: Forbid + successfulJobsHistoryLimit: 1 + jobTemplate: + spec: + backoffLimit: 0 + ttlSecondsAfterFinished: 600 + template: + spec: + serviceAccountName: virt-launcher-gc + restartPolicy: Never + containers: + - name: gc + image: bitnami/kubectl:1.33 + command: + - /bin/sh + - -ec + - | + kubectl get pod -A \ + -l kubevirt.io=virt-launcher \ + --field-selector status.phase=Succeeded \ + -o json \ + | jq -r '.items[] + | select((now - ((.status.containerStatuses[]?.state.terminated.finishedAt // .metadata.creationTimestamp) | fromdateiso8601)) > 3600) + | "\(.metadata.namespace) \(.metadata.name)"' \ + | while read ns name; do + echo "deleting $ns/$name" + kubectl -n "$ns" delete pod "$name" --ignore-not-found + done + ``` + + The filter above only deletes pods that have been in `Succeeded` for more than an hour, which leaves a grace window for investigating a recently-migrated VM. + +3. **Unstick a Terminating vTPM PVC** by deleting the Completed source pod that still references it. Confirm the running pod is healthy first: + + ```bash + NS=; VM= + kubectl -n "$NS" get pod -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$VM" -o wide + kubectl -n "$NS" get pvc | grep -i tpm + kubectl -n "$NS" delete pod + ``` + + If the PVC still refuses to finalise, inspect `kubectl -n get volumeattachment` for stale attachments and follow up on the CSI driver side. + +4. **Scope the cleanup carefully.** The filter must match *only* Completed `virt-launcher` pods. A broader `--field-selector status.phase=Succeeded` applied across all namespaces will also delete Job pods and other Succeeded workloads that the cluster is still using as audit references. + +## Diagnostic Steps + +Count Completed virt-launcher pods per VM: + +```bash +kubectl get pod -A -l kubevirt.io=virt-launcher \ + --field-selector status.phase=Succeeded \ + -o json \ +| jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.labels["vm.kubevirt.io/name"])"' \ +| sort | uniq -c | sort -rn | head +``` + +Identify Completed pods older than an hour (cleanup candidates): + +```bash +kubectl get pod -A -l kubevirt.io=virt-launcher \ + --field-selector status.phase=Succeeded \ + -o json \ +| jq -r '.items[] + | select((now - (.metadata.creationTimestamp | fromdateiso8601)) > 3600) + | "\(.metadata.namespace)\t\(.metadata.name)"' +``` + +Confirm a stuck PVC is being held by a Completed source pod, not by the Running one: + +```bash +NS=; PVC= +kubectl -n "$NS" describe pvc "$PVC" | grep -i used-by -A5 +``` + +If the `Used By` line lists more than one pod, the extras are the Completed source pods; remove them in order from oldest to newest. If `Used By` lists only the currently-Running pod and the PVC still fails to finalise, the issue is not the Completed-pod behaviour described here — investigate the CSI driver's detach path.