Skip to content

ClusterPolicy Status Fluctuates from 'Ready' to 'NotReady' and Back to 'Ready' During GPU Operator Upgrade in Multi-GPU Node Clusters #1567

@moditanisha22

Description

@moditanisha22

During a GPU Operator upgrade in a multi-GPU node cluster, we observed that the ClusterPolicy status temporarily fluctuates as follows:

  • Starts as NotReady
  • Switches to Ready (while upgrade is partially complete)
  • Then flips back to NotReady
  • Finally stabilises at Ready once the upgrade is complete on all GPU nodes

This might be due to the nvidia-driver-daemonset upgrading one node at a time, during which new pods may start while the older ones are still terminating.

Attaching status of clusterpolicy - kubectl get clusterpolicy.nvidia.com

NAME STATUS AGE
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:232
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:232
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy ready 2025-04-10T15:39:23Z
cluster-policy ready 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy notReady 2025-04-10T15:39:23Z
cluster-policy ready 2025-04-10T15:39:23Z
cluster-policy ready 2025-04-10T15:39:237

ClusterPolicy:

driver:
  enabled: true
  upgradePolicy:
    autoUpgrade: true
    maxParallelUpgrades: 1
    maxUnavailable: 25%
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    gpuPodDeletion:
      force: false
      timeoutSeconds: 300
      deleteEmptyDir: false

❓Questions:

  • Is this fluctuation in ClusterPolicy status (between Ready and NotReady) expected behaviour during the upgrade process?

  • Shouldn't the ClusterPolicy ideally remain in NotReady until the upgrade is fully completed across all GPU nodes?

  • Is there any recommended way to monitor or control the ClusterPolicy readiness more accurately during upgrades in multi-GPU node environments?

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions