# 启动报错--GPU

# Unknown runtime specified nvidia.

docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.

修改/创建/etc/docker/daemon.json(需要管理员权限),添加如下的内容:

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
         }	
    }
}

重启docker

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

# 安装nvidia-container-runtime(离线):

1、下载

libnvidia-container(软件包下载地址) (opens new window)

nvidia容器工具包的相关软件包目前已由https://nvidia.github.io/libnvidia-container存储库提供服务。在GitHub libnvidia-container仓库的gh-pages分支下,可以看到相关的软件包(建议选择stable文件夹下稳定版本的软件包来进行离线下载)

选择符合的系统,查看内部repo文件的值,按照路径返回到stable下查找需要的rpm下载

libnvidia-container核心包包括:

libnvidia-container-tools
libnvidia-container1
nvidia-container-runtime
nvidia-container-toolkit
nvidia-container-toolkit-base
nvidia-docker2

下载到服务器手动执行

rpm -Uvh *.rpm --nodeps --force

(Ubuntu)执行

dpkg -i --force-overwrite *.deb

2、修改配置文件

vi /etc/docker/daemon.json
{
	"default-runtime": "nvidia",
	"runtimes": {
		"nvidia": {
			"path": "nvidia-container-runtime",
			"runtimeArgs": []
		}	
	}
}

3、完成

检查命令

nvidia-container-runtime -v

重启docker生效

systemctl restart docker

运行docker检查

docker run --rm --runtime=nvidia nvidia/cuda nvidia-smi

进docker后 使用 nvidia-smi 检查是否正常

# nvidia-container-cli: initialization error: nvml error: insufficient permissions:unkown解决

问题:NVIDIA-Docker 在启docker的时候gpu挂不上

报错:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown.

看起来是没有权限的问题

https://link.zhihu.com/?target=https%3A//github.com/NVIDIA/nvidia-docker/issues/1547

解决方法:

打开 '/etc/nvidia-container-runtime/config.toml' 文件

将文件中的user = "root:video" 取消注释

然后改成user = "root:root"

应该是因为用户组的问题。

使用命令查看

ll /dev/nvidia*

显示结果为

crw-rw---- 1 root vglusers     195,   0 Aug 17 09:55 /dev/nvidia0
crw-rw---- 1 root vglusers     195,   1 Aug 17 09:55 /dev/nvidia1
crw-rw---- 1 root vglusers     195,   2 Aug 17 09:55 /dev/nvidia2
crw-rw---- 1 root vglusers     195,   3 Aug 17 09:55 /dev/nvidia3
crw-rw---- 1 root vglusers     195, 255 Aug 17 09:55 /dev/nvidiactl
crw-rw---- 1 root vglusers     195, 254 Jan 11 11:58 /dev/nvidia-modeset
crw-rw-rw- 1 root root         235,   0 Aug 17 09:55 /dev/nvidia-uvm
crw-rw-rw- 1 root root         235,   1 Aug 17 09:55 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root huangqinlong     80 Aug 17 12:15 ./
drwxr-xr-x 22 root root           4420 Jan 11 11:58 ../
cr--------  1 root root         238, 1 Aug 17 12:15 nvidia-cap1
cr--r--r--  1 root root         238, 2 Aug 17 12:15 nvidia-cap2

所以我们必须将docker设置为相同的用户/组才能获得GPU设备的访问权限