Skip to content

Running the GPU Operator on a kind cluster #662

@joshuacox

Description

@joshuacox

1. Issue or feature description

When following the quickstart I end up with this error in k describe po -n gpu-operator gpu-feature-discovery-6tk4h

Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

2. Steps to reproduce the issue

#!/bin/bash
kind delete cluster --name bionic-gpt-cluster
kind create cluster --name bionic-gpt-cluster --config=kind-config.yaml
kind export kubeconfig --name bionic-gpt-cluster
# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

3. Information to attach (optional if deemed irrelevant)

with my kind-config.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  # If we don't do this, then we can't connect on linux
  apiServerAddress: "0.0.0.0"
kubeadmConfigPatchesJSON6902:
- group: kubeadm.k8s.io
  version: v1beta3
  kind: ClusterConfiguration
  patch: |
    - op: add
      path: /apiServer/certSANs/-
      value: host.docker.internal
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"

Common error checking:

  • [ x ] The output of nvidia-smi -a on your host

and docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Sun Jan 21 20:24:49 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   41C    P8     8W / 220W |    100MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
  • Your docker configuration file (e.g: /etc/docker/daemon.json)

and /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

and /etc/containerd/config.toml

disabled_plugins = ["cri"]
version = 1

[plugins]

  [plugins.cri]

    [plugins.cri.containerd]
      default_runtime_name = "nvidia"

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "/usr/bin/nvidia-container-runtime"
  • The k8s-device-plugin container logs
I0121 20:28:50.870066       1 main.go:154] Starting FS watcher.
I0121 20:28:50.870195       1 main.go:161] Starting OS watcher.
I0121 20:28:50.870674       1 main.go:176] Starting Plugins.
I0121 20:28:50.870703       1 main.go:234] Loading configuration.
I0121 20:28:50.870918       1 main.go:242] Updating config with default resource matching patterns.
I0121 20:28:50.871290       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0121 20:28:50.871307       1 main.go:256] Retreiving plugins.
W0121 20:28:50.871782       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0121 20:28:50.871846       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0121 20:28:50.871896       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0121 20:28:50.871903       1 factory.go:115] Incompatible platform detected
E0121 20:28:50.871909       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0121 20:28:50.871914       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0121 20:28:50.871920       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0121 20:28:50.871925       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0121 20:28:50.871934       1 main.go:287] No devices found. Waiting indefinitely.

  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
sudo journalctl -r -u kubelet
-- No entries --

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
docker version
Client: Docker Engine - Community
 Version:           25.0.0
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        e758fe5
 Built:             Thu Jan 18 17:09:59 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.0
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       615dfdf
  Built:            Thu Jan 18 17:09:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 nvidia:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • Docker command, image and tag used
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml

and the helm below fails as well:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false
  • Kernel version from uname -a

uname -a
Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

none that I see?

sudo dmesg |grep -i nvidia
[    2.829492] nvidia: loading out-of-tree module taints kernel.
[    2.829501] nvidia: module license 'NVIDIA' taints kernel.
[    2.846803] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    2.961803] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    2.962598] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.011519] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
[    3.017901] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[    3.139762] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[    3.246519] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    3.246521] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    3.288796] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[    3.288989] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input10
[    3.328821] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[    4.018783] audit: type=1400 audit(1705866938.070:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=774 comm="apparmor_parser"
[    4.019493] audit: type=1400 audit(1705866938.070:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=774 comm="apparmor_parser"
[ 1754.666104] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1754.677753] nvidia-uvm: Loaded the UVM driver, major device number 237.
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l |grep -i nvidia 
ii  firmware-nvidia-gsp                     525.147.05-4~deb12u1                    amd64        NVIDIA GSP firmware
ii  glx-alternative-nvidia                  1.2.2                                   amd64        allows the selection of NVIDIA as GLX provider
ii  libcuda1:amd64                          525.147.05-4~deb12u1                    amd64        NVIDIA CUDA Driver Library
ii  libegl-nvidia0:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL library
ii  libgl1-nvidia-glvnd-glx:amd64           525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
ii  libgles-nvidia1:amd64                   525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                   525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL|ES 2.x library
ii  libglx-nvidia0:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary GLX library
ii  libnvcuvid1:amd64                       525.147.05-4~deb12u1                    amd64        NVIDIA CUDA Video Decoder runtime library
ii  libnvidia-allocator1:amd64              525.147.05-4~deb12u1                    amd64        NVIDIA allocator runtime library
ii  libnvidia-cfg1:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-container-tools               1.14.3-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64              1.14.3-1                                amd64        NVIDIA container runtime library
ii  libnvidia-egl-gbm1:amd64                1.1.0-2                                 amd64        GBM EGL external platform library for NVIDIA
ii  libnvidia-egl-wayland1:amd64            1:1.1.10-1                              amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-eglcore:amd64                 525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL core libraries
ii  libnvidia-encode1:amd64                 525.147.05-4~deb12u1                    amd64        NVENC Video Encoding runtime library
ii  libnvidia-glcore:amd64                  525.147.05-4~deb12u1                    amd64        NVIDIA binary OpenGL/GLX core libraries
ii  libnvidia-glvkspirv:amd64               525.147.05-4~deb12u1                    amd64        NVIDIA binary Vulkan Spir-V compiler library
ii  libnvidia-ml1:amd64                     525.147.05-4~deb12u1                    amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64         525.147.05-4~deb12u1                    amd64        NVIDIA PTX JIT Compiler library
ii  libnvidia-rtcore:amd64                  525.147.05-4~deb12u1                    amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
ii  nvidia-alternative                      525.147.05-4~deb12u1                    amd64        allows the selection of NVIDIA as GLX provider
ii  nvidia-container-toolkit                1.14.3-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base           1.14.3-1                                amd64        NVIDIA Container Toolkit Base
ii  nvidia-driver                           525.147.05-4~deb12u1                    amd64        NVIDIA metapackage
ii  nvidia-driver-bin                       525.147.05-4~deb12u1                    amd64        NVIDIA driver support binaries
ii  nvidia-driver-libs:amd64                525.147.05-4~deb12u1                    amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii  nvidia-egl-common                       525.147.05-4~deb12u1                    amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                    525.147.05-4~deb12u1                    amd64        NVIDIA EGL installable client driver (ICD)
ii  nvidia-installer-cleanup                20220217+3~deb12u1                      amd64        cleanup after driver installation with the nvidia-installer
ii  nvidia-kernel-common                    20220217+3~deb12u1                      amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                      525.147.05-4~deb12u1                    amd64        NVIDIA binary kernel module DKMS source
ii  nvidia-kernel-support                   525.147.05-4~deb12u1                    amd64        NVIDIA binary kernel module support files
ii  nvidia-legacy-check                     525.147.05-4~deb12u1                    amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-modprobe                         535.54.03-1~deb12u1                     amd64        utility to load NVIDIA kernel modules and create device nodes
ii  nvidia-persistenced                     525.85.05-1                             amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-settings                         525.125.06-1~deb12u1                    amd64        tool for configuring the NVIDIA graphics driver
ii  nvidia-smi                              525.147.05-4~deb12u1                    amd64        NVIDIA System Management Interface
ii  nvidia-support                          20220217+3~deb12u1                      amd64        NVIDIA binary graphics driver support files
ii  nvidia-vdpau-driver:amd64               525.147.05-4~deb12u1                    amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                    525.147.05-4~deb12u1                    amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                 525.147.05-4~deb12u1                    amd64        NVIDIA Vulkan installable client driver (ICD)
ii  xserver-xorg-video-nvidia               525.147.05-4~deb12u1                    amd64        NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V

nvidia-container-cli -V
cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

the above page no longer exists.

sudo journalctl -u nvidia-container-toolkit
-- No entries --

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions