Skip to content

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

@dasantonym

Description

@dasantonym

System

Running on bare-metal

  • Ubuntu 20.04.4
  • Kubernetes v1.24.3
  • Containerd 1.6.7
  • GPU-Operator v1.11.1

Setup

GPU-Operator is installed with:

helm install --wait --debug --generate-name --create-namespace \
      nvidia/gpu-operator \
      -n gpu-operator \
      --set migManager.config.name=mig-config \
      --set mig.strategy=mixed \
      --set driver.enabled=false \
      --set toolkit.enabled=false

MIG-Config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      node-standard:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [1]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [2]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [3]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [4,5,6,7]
          mig-enabled: false

Issue

I am running a cluster with multiple GPU-Nodes and some of the nodes are using MIG, others are not. Now, as long as all nodes have their MIG config set to all-disabled, everything is fine. As soon as I set one node to a mixed MIG config, the nvidia-device-plugin-validator fails with the message:

spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory

Once I switch back the MIG config to all-disabled, the validation succeeds again.

Edit: To further clarify: The validator only fails with the wrong driver root value for the node where I activate MIG. The other validator pods are unaffected, even after further (node) restarts, as long as MIG remains off.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions