-
Notifications
You must be signed in to change notification settings - Fork 475
ClusterPolicy Status Fluctuates from 'Ready' to 'NotReady' and Back to 'Ready' During GPU Operator Upgrade in Multi-GPU Node Clusters #1567
Description
During a GPU Operator upgrade in a multi-GPU node cluster, we observed that the ClusterPolicy status temporarily fluctuates as follows:
- Starts as NotReady
- Switches to Ready (while upgrade is partially complete)
- Then flips back to NotReady
- Finally stabilises at Ready once the upgrade is complete on all GPU nodes
This might be due to the nvidia-driver-daemonset upgrading one node at a time, during which new pods may start while the older ones are still terminating.
Attaching status of clusterpolicy - kubectl get clusterpolicy.nvidia.com
| NAME | STATUS | AGE |
|---|---|---|
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:232 |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:232 |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | ready | 2025-04-10T15:39:23Z |
| cluster-policy | ready | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | notReady | 2025-04-10T15:39:23Z |
| cluster-policy | ready | 2025-04-10T15:39:23Z |
| cluster-policy | ready | 2025-04-10T15:39:237 |
ClusterPolicy:
driver:
enabled: true
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
maxUnavailable: 25%
waitForCompletion:
timeoutSeconds: 0
podSelector: ""
gpuPodDeletion:
force: false
timeoutSeconds: 300
deleteEmptyDir: false
❓Questions:
-
Is this fluctuation in ClusterPolicy status (between Ready and NotReady) expected behaviour during the upgrade process?
-
Shouldn't the ClusterPolicy ideally remain in NotReady until the upgrade is fully completed across all GPU nodes?
-
Is there any recommended way to monitor or control the ClusterPolicy readiness more accurately during upgrades in multi-GPU node environments?