During a GPU Operator upgrade in a multi-GPU node cluster, we observed that the ClusterPolicy status temporarily fluctuates as follows:

Starts as NotReady
Switches to Ready (while upgrade is partially complete)
Then flips back to NotReady
Finally stabilises at Ready once the upgrade is complete on all GPU nodes

This might be due to the nvidia-driver-daemonset upgrading one node at a time, during which new pods may start while the older ones are still terminating.

Attaching status of clusterpolicy - kubectl get clusterpolicy.nvidia.com

NAME	STATUS	AGE
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:232
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:232
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	ready	2025-04-10T15:39:23Z
cluster-policy	ready	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	notReady	2025-04-10T15:39:23Z
cluster-policy	ready	2025-04-10T15:39:23Z
cluster-policy	ready	2025-04-10T15:39:237

ClusterPolicy:

driver:
  enabled: true
  upgradePolicy:
    autoUpgrade: true
    maxParallelUpgrades: 1
    maxUnavailable: 25%
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    gpuPodDeletion:
      force: false
      timeoutSeconds: 300
      deleteEmptyDir: false

❓Questions:

Is this fluctuation in ClusterPolicy status (between Ready and NotReady) expected behaviour during the upgrade process?
Shouldn't the ClusterPolicy ideally remain in NotReady until the upgrade is fully completed across all GPU nodes?
Is there any recommended way to monitor or control the ClusterPolicy readiness more accurately during upgrades in multi-GPU node environments?

ClusterPolicy Status Fluctuates from 'Ready' to 'NotReady' and Back to 'Ready' During GPU Operator Upgrade in Multi-GPU Node Clusters #1567

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions