From 291cb6820159dd6a35ab47e3cbd3f449741d2d3d Mon Sep 17 00:00:00 2001 From: Komh Date: Wed, 22 Apr 2026 15:09:29 +0800 Subject: [PATCH] [virtualization] VM Live Migration Fails After Virtualization Upgrade on Emerald Rapids Hosts --- ...ization_Upgrade_on_Emerald_Rapids_Hosts.md | 64 +++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/en/solutions/VM_Live_Migration_Fails_After_Virtualization_Upgrade_on_Emerald_Rapids_Hosts.md diff --git a/docs/en/solutions/VM_Live_Migration_Fails_After_Virtualization_Upgrade_on_Emerald_Rapids_Hosts.md b/docs/en/solutions/VM_Live_Migration_Fails_After_Virtualization_Upgrade_on_Emerald_Rapids_Hosts.md new file mode 100644 index 00000000..b18db8af --- /dev/null +++ b/docs/en/solutions/VM_Live_Migration_Fails_After_Virtualization_Upgrade_on_Emerald_Rapids_Hosts.md @@ -0,0 +1,64 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +After upgrading the ACP Virtualization Operator, live migration fails for VMs that had been live-migrated at least once before the upgrade: + +- The target `virt-launcher` pod stays in `Pending` because no node satisfies its scheduling requirements. +- Newly created or cold-restarted VMs migrate successfully. +- The problem is observed on worker nodes running on Intel Emerald Rapids CPUs. + +## Root Cause + +Older `libvirt` builds do not have an exact match for Emerald Rapids and report `SapphireRapids` as the `host-model`. While that was the reported model, the KubeVirt `node-labeller` applied the label `cpu-model-migration.node.kubevirt.io/SapphireRapids=true` to every node, and any VM that live-migrated during that period picked up a matching `nodeSelector` on its `virt-launcher` pod. + +A newer `libvirt` shipped with the updated operator adds a proper entry for Emerald Rapids, which it reports as `SierraForest`. Once the upgrade is applied and the node-labeller re-runs, the `SapphireRapids` label is removed and replaced with `SierraForest`. VMs that still carry the `SapphireRapids` nodeSelector on their `virt-launcher` pod no longer match any node in the cluster, and migration targets fail to schedule. + +In addition, the affected build reports the `SapphireRapids` CPU model as `usable=no` in `domcapabilities`, so even a workaround that re-pins the old label on a single node cannot satisfy new migrations reliably. + +## Resolution + +There are two workarounds. Pick the one that fits the maintenance window for the affected VMs. + +**Option A — cold-restart the affected VMs (preferred).** Stopping and starting the VM creates a fresh `virt-launcher` pod without the stale `SapphireRapids` nodeSelector. Subsequent live migrations work as expected. + +**Option B — re-apply the missing node label (no VM restart).** For each worker node on which the affected VMs must be able to run: + +```bash +# Tell the node-labeller to skip this node so our manual label survives +kubectl annotate node node-labeller.kubevirt.io/skip-node=true + +# Re-apply the label the stuck VMs still reference +kubectl label node cpu-model-migration.node.kubevirt.io/SapphireRapids=true +``` + +Apply Option B only after the operator upgrade has completed. If the node-labeller had been disabled prior to upgrade, re-enable it first, let it re-label the nodes, and then apply the annotate-plus-label step. + +Option A is preferred once a maintenance window is available, because Option B pins a CPU model that the host no longer reports as usable. + +## Diagnostic Steps + +Inspect a failing VM's `virt-launcher` pod for the stale nodeSelector: + +```bash +kubectl get pod -o yaml | yq '.spec.nodeSelector' +# cpu-model-migration.node.kubevirt.io/SapphireRapids: "true" +``` + +Compare node labels — post-upgrade, the `SapphireRapids` label is gone and `SierraForest` is present instead: + +```bash +kubectl get node --show-labels | grep cpu-model-migration +``` + +Check `domcapabilities` from the affected VM's `virt-launcher` pod to confirm the CPU model reported by `libvirt`. A fresh post-upgrade pod reports `SierraForest`, while any pod carrying the old model predates the upgrade: + +```bash +kubectl exec -- virsh domcapabilities | grep -A1 'model' +```