-
Notifications
You must be signed in to change notification settings - Fork 475
Running the GPU Operator on a kind cluster #662
Description
1. Issue or feature description
When following the quickstart I end up with this error in k describe po -n gpu-operator gpu-feature-discovery-6tk4h
Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
2. Steps to reproduce the issue
#!/bin/bash
kind delete cluster --name bionic-gpt-cluster
kind create cluster --name bionic-gpt-cluster --config=kind-config.yaml
kind export kubeconfig --name bionic-gpt-cluster
# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
3. Information to attach (optional if deemed irrelevant)
with my kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
# If we don't do this, then we can't connect on linux
apiServerAddress: "0.0.0.0"
kubeadmConfigPatchesJSON6902:
- group: kubeadm.k8s.io
version: v1beta3
kind: ClusterConfiguration
patch: |
- op: add
path: /apiServer/certSANs/-
value: host.docker.internal
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
Common error checking:
- [ x ] The output of
nvidia-smi -aon your host
and docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Sun Jan 21 20:24:49 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 41C P8 8W / 220W | 100MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
- Your docker configuration file (e.g:
/etc/docker/daemon.json)
and /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
and /etc/containerd/config.toml
disabled_plugins = ["cri"]
version = 1
[plugins]
[plugins.cri]
[plugins.cri.containerd]
default_runtime_name = "nvidia"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "/usr/bin/nvidia-container-runtime"
- The k8s-device-plugin container logs
I0121 20:28:50.870066 1 main.go:154] Starting FS watcher.
I0121 20:28:50.870195 1 main.go:161] Starting OS watcher.
I0121 20:28:50.870674 1 main.go:176] Starting Plugins.
I0121 20:28:50.870703 1 main.go:234] Loading configuration.
I0121 20:28:50.870918 1 main.go:242] Updating config with default resource matching patterns.
I0121 20:28:50.871290 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0121 20:28:50.871307 1 main.go:256] Retreiving plugins.
W0121 20:28:50.871782 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0121 20:28:50.871846 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0121 20:28:50.871896 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0121 20:28:50.871903 1 factory.go:115] Incompatible platform detected
E0121 20:28:50.871909 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0121 20:28:50.871914 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0121 20:28:50.871920 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0121 20:28:50.871925 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0121 20:28:50.871934 1 main.go:287] No devices found. Waiting indefinitely.
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
sudo journalctl -r -u kubelet
-- No entries --
Additional information that might help better understand your environment and reproduce the bug:
- Docker version from
docker version
docker version
Client: Docker Engine - Community
Version: 25.0.0
API version: 1.44
Go version: go1.21.6
Git commit: e758fe5
Built: Thu Jan 18 17:09:59 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 25.0.0
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: 615dfdf
Built: Thu Jan 18 17:09:59 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.27
GitCommit: a1496014c916f9e62104b33d1bb5bd03b0858e59
nvidia:
Version: 1.1.11
GitCommit: v1.1.11-0-g4bccb38
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- Docker command, image and tag used
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
and the helm below fails as well:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
- Kernel version from
uname -a
uname -a
Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
- Any relevant kernel output lines from
dmesg
none that I see?
sudo dmesg |grep -i nvidia
[ 2.829492] nvidia: loading out-of-tree module taints kernel.
[ 2.829501] nvidia: module license 'NVIDIA' taints kernel.
[ 2.846803] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.961803] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 2.962598] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 3.011519] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.147.05 Wed Oct 25 20:27:35 UTC 2023
[ 3.017901] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[ 3.139762] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.147.05 Wed Oct 25 20:21:31 UTC 2023
[ 3.246519] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 3.246521] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 3.288796] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[ 3.288989] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input10
[ 3.328821] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[ 4.018783] audit: type=1400 audit(1705866938.070:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=774 comm="apparmor_parser"
[ 4.019493] audit: type=1400 audit(1705866938.070:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=774 comm="apparmor_parser"
[ 1754.666104] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1754.677753] nvidia-uvm: Loaded the UVM driver, major device number 237.
- NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
dpkg -l |grep -i nvidia
ii firmware-nvidia-gsp 525.147.05-4~deb12u1 amd64 NVIDIA GSP firmware
ii glx-alternative-nvidia 1.2.2 amd64 allows the selection of NVIDIA as GLX provider
ii libcuda1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA CUDA Driver Library
ii libegl-nvidia0:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL library
ii libgl1-nvidia-glvnd-glx:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant)
ii libgles-nvidia1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL|ES 1.x library
ii libgles-nvidia2:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL|ES 2.x library
ii libglx-nvidia0:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary GLX library
ii libnvcuvid1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA CUDA Video Decoder runtime library
ii libnvidia-allocator1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA allocator runtime library
ii libnvidia-cfg1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-container-tools 1.14.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.14.3-1 amd64 NVIDIA container runtime library
ii libnvidia-egl-gbm1:amd64 1.1.0-2 amd64 GBM EGL external platform library for NVIDIA
ii libnvidia-egl-wayland1:amd64 1:1.1.10-1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-eglcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL core libraries
ii libnvidia-encode1:amd64 525.147.05-4~deb12u1 amd64 NVENC Video Encoding runtime library
ii libnvidia-glcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary OpenGL/GLX core libraries
ii libnvidia-glvkspirv:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary Vulkan Spir-V compiler library
ii libnvidia-ml1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA Management Library (NVML) runtime library
ii libnvidia-ptxjitcompiler1:amd64 525.147.05-4~deb12u1 amd64 NVIDIA PTX JIT Compiler library
ii libnvidia-rtcore:amd64 525.147.05-4~deb12u1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library
ii nvidia-alternative 525.147.05-4~deb12u1 amd64 allows the selection of NVIDIA as GLX provider
ii nvidia-container-toolkit 1.14.3-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.3-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-driver 525.147.05-4~deb12u1 amd64 NVIDIA metapackage
ii nvidia-driver-bin 525.147.05-4~deb12u1 amd64 NVIDIA driver support binaries
ii nvidia-driver-libs:amd64 525.147.05-4~deb12u1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii nvidia-egl-common 525.147.05-4~deb12u1 amd64 NVIDIA binary EGL driver - common files
ii nvidia-egl-icd:amd64 525.147.05-4~deb12u1 amd64 NVIDIA EGL installable client driver (ICD)
ii nvidia-installer-cleanup 20220217+3~deb12u1 amd64 cleanup after driver installation with the nvidia-installer
ii nvidia-kernel-common 20220217+3~deb12u1 amd64 NVIDIA binary kernel module support files
ii nvidia-kernel-dkms 525.147.05-4~deb12u1 amd64 NVIDIA binary kernel module DKMS source
ii nvidia-kernel-support 525.147.05-4~deb12u1 amd64 NVIDIA binary kernel module support files
ii nvidia-legacy-check 525.147.05-4~deb12u1 amd64 check for NVIDIA GPUs requiring a legacy driver
ii nvidia-modprobe 535.54.03-1~deb12u1 amd64 utility to load NVIDIA kernel modules and create device nodes
ii nvidia-persistenced 525.85.05-1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-settings 525.125.06-1~deb12u1 amd64 tool for configuring the NVIDIA graphics driver
ii nvidia-smi 525.147.05-4~deb12u1 amd64 NVIDIA System Management Interface
ii nvidia-support 20220217+3~deb12u1 amd64 NVIDIA binary graphics driver support files
ii nvidia-vdpau-driver:amd64 525.147.05-4~deb12u1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver
ii nvidia-vulkan-common 525.147.05-4~deb12u1 amd64 NVIDIA Vulkan driver - common files
ii nvidia-vulkan-icd:amd64 525.147.05-4~deb12u1 amd64 NVIDIA Vulkan installable client driver (ICD)
ii xserver-xorg-video-nvidia 525.147.05-4~deb12u1 amd64 NVIDIA binary Xorg driver
- NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- NVIDIA container library logs (see troubleshooting)
the above page no longer exists.
sudo journalctl -u nvidia-container-toolkit
-- No entries --