🐻 GPU资源配置
# k8s集群调度GPU资源配置
## 安装nvidia-docker2
设置包存储库和GPG密钥
```shell
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
```
更新包列表后安装nvidia-docker2
```shell
sudo apt-get update
sudo apt-get install -y nvidia-docker2
```
配置docker运行时为NVIDIA
```json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
"exec-opts": ["native.cgroupdriver=systemd"]
}
```
其中:`runtimes`参数是定义运行时,这里定义了一个名为nvidia的运行时环境,`default-runtime`表示指定默认的运行时为刚刚定义的`nvidia`。
最后一句`"exec-opts": ["native.cgroupdriver=systemd"]`的作用是,因为K8S的文件驱动为cgroupfs,而docker的文件驱动为systemd,两者不同会导致镜像无法启动,因此需要将K8S文件驱动也指定为systemd。
## 安装GPU驱动
查看GPU信息
```shell
sudo ubuntu-drivers devices
```
如未正常显示显卡信息和推荐驱动,需先安装ubuntu-drivers-commo
```shell
sudo apt update
sudo apt-get install ubuntu-drivers-common
```
然后再查看GPU信息,如下所示:
```shell
== /sys/devices/pci021:21/1221:12:01.1/0000:02:00.0 ==
modalias : pci:v123123123123123123123123123123123123123123123123
vendor : NVIDIA Corporation
model : GP102 [GeForce GTX 1080 Ti]
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-515 - distro non-free recommended
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-390 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
```
可以看到推荐的驱动版本为nvidia-driver-515,安装此驱动
```shell
sudo apt-get install nvidia-driver-515
```
等待几分钟后安装完成,安装完成,重启系统
输入`nvidia-smi`命令,确认驱动安装成功
如果未能显示GPU信息,使用以下命名调用显卡信息时报错`NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running`,这种情况下首先考虑是否是系统的内核版本发生改变导致的报错,这里通过DKMS重新编译内核模块:
首先查询本机的内核版本:
```shell
ls /usr/src | grep nvidia
```
然后编译模块:
```shell
sudo apt-get install dkms
sudo dkms install -m nvidia -v
```
-v 后为上一步查询的详细的内核版本号
## 安装k8s的GPU驱动插件
下载对应的yml文件
```shell
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml
```
使用k8s命令安装
```shell
kubectl create -f nvidia-device-plugin.yml
kubectl get pods -n kube-system
```
稍等几分钟后,安装完成,查看对应日志检查
```shell
kubectl logs nvidia-device-plugin-daemonset-pmdcs -n kube-system
```
>i 这部分我在WSL2中完成的,我安装了`minikube`,安装地址在:`https://minikube.sigs.k8s.io/docs/start/`,`minikube`是一个模拟器