文章/答案/技术大牛

发布

社区首页 >问答首页 >NVIDIA驱动程序无法与NVIDIA通信

问NVIDIA驱动程序无法与NVIDIA通信
EN

Server Fault用户

提问于 2018-12-04 15:36:40

回答 2查看 13.7K关注 0票数 3

问题描述

我正试图在Google上设置一个Centos-7GPU (Nvidia Tesla K80)实例，以执行CUDA的工作。

不幸的是，我似乎无法正确安装/配置驱动程序。

实际上，当尝试与nvidia-smi ()交互时，会发生这样的情况：

# nvidia-smi -pm 1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

与最近的方法nvidia-persistenced相同的操作：

# nvidia-persistenced
nvidia-persistenced failed to initialize. Check syslog for more details.

我在syslog中得到以下错误(使用journalctl命令)：

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

事实上，没有任何nvidia设备存在：

# ll /dev/nvidia*
ls: cannot access /dev/nvidia*: No such file or directory

但是，以下是GPU正确连接到实例的证明：

# lshw -numeric -C display
  *-display UNCLAIMED       
       description: 3D controller
       product: GK210GL [Tesla K80] [10DE:102D]
       vendor: NVIDIA Corporation [10DE]
       physical id: 4
       bus info: pci@0000:00:04.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm cap_list
       configuration: latency=0
       resources: iomemory:40-3f iomemory:80-7f memory:fc000000-fcffffff memory:400000000-7ffffffff memory:800000000-801ffffff ioport:c000(size=128)

安装过程我遵循了

创建centos-7实例，遵循谷歌云文档的这一部分：

gcloud compute instances create test-gpu-drivers \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family centos-7 --image-project centos-cloud \
    --maintenance-policy TERMINATE

然后，我为驱动程序所遵循的安装过程受到了谷歌文档的启发，但是使用了最新的版本：

gcloud compute ssh test-gpu-drivers
sudo su
yum -y update

# Reboot for kernel update to be taken into account
reboot

gcloud compute ssh test-gpu-drivers
sudo su

# Install nvidia drivers repository, found here: https://www.nvidia.com/Download/index.aspx?lang=en-us
curl -J -O http://us.download.nvidia.com/tesla/410.72/nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm
yum -y install ./nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm

# Install CUDA repository, found here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=rpmlocal
curl -J -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
yum -y install ./cuda-repo-rhel7-10.0.130-1.x86_64.rpm

# Install CUDA & drivers & dependencies
yum clean all
yum -y install cuda

nvidia-smi -pm 1

reboot

gcloud compute ssh test-gpu-drivers
sudo su
nvidia-smi -pm 1

完整日志这里。

(我也尝试了确切的GCE驱动程序安装脚本，没有升级版本，但也没有运气)

环境

发行版# cat /etc/*- -n 1 CentOS发布版7.6.1810 (核心)
内核版本# uname -r 3.10.0-957.1.3.el7.x86_64

我可以让它在Ubuntu!

上工作

为了分析这个问题，我决定尝试在Ubuntu18.04 (LTS)上做同样的事情。这一次，我没有问题。

实例创建：

gcloud compute instances create gpu-ubuntu-1804 \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family ubuntu-1804-lts --image-project ubuntu-os-cloud \
    --maintenance-policy TERMINATE

安装过程：

gcloud compute ssh gpu-ubuntu-1804
sudo su
apt update
apt -y upgrade
reboot

gcloud compute ssh gpu-ubuntu-1804
sudo su
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt -y install ./cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
rm cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-get update
apt-get -y install cuda
nvidia-smi -pm 1

完整的安装日志可用这里。

测试：

# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:00:04.0.
All done.
# ll /dev/nvidia*
crw-rw-rw- 1 root root 241,   0 Dec  4 14:01 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195,   0 Dec  4 14:01 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  4 14:01 /dev/nvidiactl

我注意到的一件事是，在Ubuntu上安装包nvidia-dkms会触发一些东西，而我在centos上没有看到这一点：

Setting up nvidia-dkms-410 (410.79-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

Loading new nvidia-410.79 DKMS files...
Building for 4.15.0-1025-gcp
Building for architecture x86_64
Building initial module for 4.15.0-1025-gcp
Generating a 2048 bit RSA private key
.............................................................................................................+++
..........+++
writing new private key to '/var/lib/shim-signed/mok/MOK.priv'
-----
EFI variables are not supported on this system
/sys/firmware/efi/efivars not found, aborting.
Done.

nvidia:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

depmod...

DKMS: install completed.

环境

发行版root@gpu-ubuntu-1804:/home/elouan_keryell-even# cat /etc/*-发布DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION=“Ubuntu18.04.1 LTS”NAME="Ubuntu“VERSION="18.04.1 LTS (仿生海狸)”“ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.1 LTS”VERSION_ID="18.04“HOME_URL="https://www.ubuntu.com/”SUPPORT_URL=https://help.ubuntu.com/“BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/”PRIVACY_POLICY_URL=“en19#en19#
内核发行版root@gpu-ubuntu-1804:/home/elouan_keryell-even# uname -r 4.15.0-1025-gcp

问题

有人知道我在Centos 7上安装NVIDIA驱动程序出了什么问题吗？

nvidia

centos

centos7

google-cloud-platform

google-compute-engine

回答 2

Server Fault用户

发布于 2018-12-19 10:21:15

有两个问题：

CentOS默认使用nouveau开源驱动程序，该驱动程序与nvidia不兼容，必须停用。
来自nvidia回购的驱动程序似乎不起作用，因为nvidia dkms模块是必需的。

要做到这一点：

安装一些必需的软件包，yum安装内核-devel epel-release
编辑/etc/default/grub并在GRUB_CMDLINE_LINUX中添加nouveau.modeset=0
应用更改的Regen grub配置: grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
重新启动以使更改生效。
然后直接安装这个驱动程序：http://fr.download.nvidia.com/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run

在那之后，nvidia-smi应该可以工作了。

票数 2

Server Fault用户

发布于 2018-12-05 17:26:50

这个问题已经向google报告，并且正在这里上进行研究。

票数 0

页面原文内容由Server Fault提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://serverfault.com/questions/942844

复制

相似问题

问NVIDIA驱动程序无法与NVIDIA通信
EN

问题描述

安装过程我遵循了

环境

我可以让它在Ubuntu!

环境

问题

回答 2

Server Fault用户

Server Fault用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NVIDIA驱动程序无法与NVIDIA通信EN

问题描述

安装过程我遵循了

环境

我可以让它在Ubuntu!

环境

问题

回答 2

Server Fault用户

Server Fault用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NVIDIA驱动程序无法与NVIDIA通信
EN