-
Notifications
You must be signed in to change notification settings - Fork 475
nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403
Copy link
Copy link
Open
Labels
Milestone
Description
System
Running on bare-metal
- Ubuntu 20.04.4
- Kubernetes v1.24.3
- Containerd 1.6.7
- GPU-Operator v1.11.1
Setup
GPU-Operator is installed with:
helm install --wait --debug --generate-name --create-namespace \
nvidia/gpu-operator \
-n gpu-operator \
--set migManager.config.name=mig-config \
--set mig.strategy=mixed \
--set driver.enabled=false \
--set toolkit.enabled=false
MIG-Config:
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
node-standard:
- devices: [0]
mig-enabled: true
mig-devices:
"2g.20gb": 3
- devices: [1]
mig-enabled: true
mig-devices:
"2g.20gb": 3
- devices: [2]
mig-enabled: true
mig-devices:
"3g.40gb": 2
- devices: [3]
mig-enabled: true
mig-devices:
"3g.40gb": 2
- devices: [4,5,6,7]
mig-enabled: falseIssue
I am running a cluster with multiple GPU-Nodes and some of the nodes are using MIG, others are not. Now, as long as all nodes have their MIG config set to all-disabled, everything is fine. As soon as I set one node to a mixed MIG config, the nvidia-device-plugin-validator fails with the message:
spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory
Once I switch back the MIG config to all-disabled, the validation succeeds again.
Edit: To further clarify: The validator only fails with the wrong driver root value for the node where I activate MIG. The other validator pods are unaffected, even after further (node) restarts, as long as MIG remains off.
Reactions are currently unavailable