首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NVidia驱动程序停止使用Ubuntu16.04和Tesla K80 GPU处理AWS EC2实例

NVidia驱动程序停止使用Ubuntu16.04和Tesla K80 GPU处理AWS EC2实例
EN

Stack Overflow用户
提问于 2019-03-20 13:24:28
回答 5查看 9.5K关注 0票数 6

我已经使用一个AWS EC2实例和一个Tesla K80 GPU来运行TensorFlow代码。我安装了CUDA9.0和cuDNN 7.1.4,我使用TF 1.12,所有这些都是在Ubuntu16.04上安装的。

直到昨天,一切都进行得很顺利,但是今天看来,NVidia驱动程序由于某种原因已经停止运行了:

代码语言:javascript
复制
ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我查过司机:

代码语言:javascript
复制
ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc  nvidia-367                              367.48-0ubuntu1                            amd64        NVIDIA binary driver - version 367.48
ii  nvidia-396                              396.37-0ubuntu1                            amd64        NVIDIA binary driver - version 396.37
ii  nvidia-396-dev                          396.37-0ubuntu1                            amd64        NVIDIA binary Xorg driver development files
ii  nvidia-machine-learning-repo-ubuntu1604 1.0.0-1                                    amd64        nvidia-machine-learning repository configuration files
ii  nvidia-modprobe                         396.37-0ubuntu1                            amd64        Load the NVIDIA kernel driver and create device files
rc  nvidia-opencl-icd-367                   367.48-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-opencl-icd-396                   396.37-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                            0.8.2                                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                         396.37-0ubuntu1                            amd64        Tool for configuring the NVIDIA graphics driver

现在似乎有两个不同的版本,这会不会是个问题?(但我看不出为什么一切都是这样的)。

在查找this thread时,我检查了内核,它与线程中提到的内核明显不同:

代码语言:javascript
复制
ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

有没有人遇到这个问题,知道如何解决它?提前感谢您的帮助!

编辑:

当试图用@Dehydrated_Mud的方法升级驱动程序时,我得到了以下错误:

代码语言:javascript
复制
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

以及日志文件的内容:

代码语言:javascript
复制
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --no-drm
    --disable-nouveau
    --dkms
    --silent
    --install-libglvnd

Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.

Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:

The package that is already installed is named nvidia-396.

You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`

You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`

This package is maintained by NVIDIA (cudatools@nvidia.com).


(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

运行apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'提供:

代码语言:javascript
复制
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82
EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2019-03-20 14:37:37

我通过更新最新的Nvidia驱动程序来解决这个问题。使用:

代码语言:javascript
复制
nvcc --version

以获得cuda工具包版本号。对于9.0,最新的司机是384.183,410.104的CUDA 10.0。

然后跑:

代码语言:javascript
复制
 wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run

下载驱动程序。

然后跑:

代码语言:javascript
复制
sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd

来安装驱动程序。

跑:

代码语言:javascript
复制
nvidia-smi

若要检查问题是否已解决,请执行以下操作。

票数 12
EN

Stack Overflow用户

发布于 2020-08-21 11:41:22

虽然重新安装驱动程序可以使驱动程序正常工作,但这并不能解决问题,也不能正确地解决这个问题。我在ubuntu上观察到了同样的问题,重新安装驱动程序是一个解决办法,直到它再次崩溃的那天。这种自发的nvidia cuda驱动程序故障的原因是ubuntu的自动安全更新。当有重新构建内核的更新时,它将破坏cuda驱动程序,并且nvidia-smi将不会与驱动程序通信。一个简单的解决方案是禁用自动安全更新:

代码语言:javascript
复制
sudo apt -y remove unattended-upgrades
票数 4
EN

Stack Overflow用户

发布于 2019-03-27 23:35:29

代码语言:javascript
复制
#!/bin/bash

set -x

version=$1
#version=410.79
#version=410.104

wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 
  1. 将上面的内容保存为类似于install.sh的内容。
  2. sh install.sh 410.104
  3. sudo modprobe nvidia

GPU应该马上回来,请与nvidia-smi核对

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55261785

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档