🐻 GPU资源配置

# k8s集群调度GPU资源配置 ## 安装nvidia-docker2 设置包存储库和GPG密钥 ```shell distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` 更新包列表后安装nvidia-docker2 ```shell sudo apt-get update sudo apt-get install -y nvidia-docker2 ``` 配置docker运行时为NVIDIA ```json { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "exec-opts": ["native.cgroupdriver=systemd"] } ``` 其中:`runtimes`参数是定义运行时,这里定义了一个名为nvidia的运行时环境,`default-runtime`表示指定默认的运行时为刚刚定义的`nvidia`。 最后一句`"exec-opts": ["native.cgroupdriver=systemd"]`的作用是,因为K8S的文件驱动为cgroupfs,而docker的文件驱动为systemd,两者不同会导致镜像无法启动,因此需要将K8S文件驱动也指定为systemd。 ## 安装GPU驱动 查看GPU信息 ```shell sudo ubuntu-drivers devices ``` 如未正常显示显卡信息和推荐驱动,需先安装ubuntu-drivers-commo ```shell sudo apt update sudo apt-get install ubuntu-drivers-common ``` 然后再查看GPU信息,如下所示: ```shell == /sys/devices/pci021:21/1221:12:01.1/0000:02:00.0 == modalias : pci:v123123123123123123123123123123123123123123123123 vendor : NVIDIA Corporation model : GP102 [GeForce GTX 1080 Ti] driver : nvidia-driver-450-server - distro non-free driver : nvidia-driver-510 - distro non-free driver : nvidia-driver-510-server - distro non-free driver : nvidia-driver-470 - distro non-free driver : nvidia-driver-515 - distro non-free recommended driver : nvidia-driver-418-server - distro non-free driver : nvidia-driver-515-server - distro non-free driver : nvidia-driver-470-server - distro non-free driver : nvidia-driver-390 - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin ``` 可以看到推荐的驱动版本为nvidia-driver-515,安装此驱动 ```shell sudo apt-get install nvidia-driver-515 ``` 等待几分钟后安装完成,安装完成,重启系统 输入`nvidia-smi`命令,确认驱动安装成功 如果未能显示GPU信息,使用以下命名调用显卡信息时报错`NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running`,这种情况下首先考虑是否是系统的内核版本发生改变导致的报错,这里通过DKMS重新编译内核模块: 首先查询本机的内核版本: ```shell ls /usr/src | grep nvidia ``` 然后编译模块: ```shell sudo apt-get install dkms sudo dkms install -m nvidia -v ``` -v 后为上一步查询的详细的内核版本号 ## 安装k8s的GPU驱动插件 下载对应的yml文件 ```shell wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml ``` 使用k8s命令安装 ```shell kubectl create -f nvidia-device-plugin.yml kubectl get pods -n kube-system ``` 稍等几分钟后,安装完成,查看对应日志检查 ```shell kubectl logs nvidia-device-plugin-daemonset-pmdcs -n kube-system ``` >i 这部分我在WSL2中完成的,我安装了`minikube`,安装地址在:`https://minikube.sigs.k8s.io/docs/start/`,`minikube`是一个模拟器