首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >CUDA:驱动程序版本正在正确升级,但Runtime没有升级

CUDA:驱动程序版本正在正确升级,但Runtime没有升级
EN

Ask Ubuntu用户
提问于 2020-10-27 04:21:09
回答 1查看 1.6K关注 0票数 0

我正在尝试安装Cuda11.1,运行时api和我的gpu。

我正在运行Ubuntu x86_64 18.04。我已经尝试将我的Cuda运行时升级到11.1,但一直未能做到。驱动程序已经更新,但我的运行时api没有更新。

nvidia-smi

显示我已经升级到11.0,但是

nvcc -V

显示为运行时API安装的10.0.130版本。

按照https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html的指示

我将按照指南中列出的顺序来检查这些命令。

第二节.安装前动作

lspci | grep -i nvidia导致

代码语言:javascript
复制
19:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
19:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
19:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
19:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
67:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
67:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
67:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
68:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
68:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
68:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)

uname -m && cat /etc/*release导致

代码语言:javascript
复制
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

gcc --version结果

代码语言:javascript
复制
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

uname -r的结果

代码语言:javascript
复制
5.4.0-51-generic

sudo apt-get install linux-headers-$(uname -r)的结果

代码语言:javascript
复制
Reading package lists... Done
Building dependency tree       
Reading state information... Done
linux-headers-5.4.0-51-generic is already the newest version (5.4.0-51.56~18.04.1).
linux-headers-5.4.0-51-generic set to manually installed.
The following packages were automatically installed and are no longer required:
  dkms libaccinj64-10.0 libatomic1:i386 libboost-python1.65.1 libbsd0:i386 libc-ares2 libcublas10.0 libcudnn7 libcufft10.0 libcufftw10.0 libcuinj64-10.0 libcupti-dev libcupti-doc libcupti10.0 libcurand10.0
  libcusolver10.0 libcusparse10.0 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386 libedit2:i386 libelf1:i386 libexpat1:i386 libffi6:i386 libgflags2.2 libgl1:i386
  libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386 libglx-mesa0:i386 libglx0:i386 libgoogle-glog0v5 libgrpc7 libjs-sphinxdoc libleveldb1v5 libllvm10:i386 liblmdb0 libnppc10.0 libnppial10.0 libnppicc10.0
  libnppicom10.0 libnppidei10.0 libnppif10.0 libnppig10.0 libnppim10.0 libnppist10.0 libnppisu10.0 libnppitc10.0 libnpps10.0 libnvblas10.0 libnvgraph10.0 libnvidia-cfg1-450 libnvidia-common-450
  libnvidia-compute-450:i386 libnvidia-decode-450 libnvidia-decode-450:i386 libnvidia-encode-450 libnvidia-encode-450:i386 libnvidia-extra-450 libnvidia-extra-450:i386 libnvidia-fbc1-450 libnvidia-fbc1-450:i386
  libnvidia-gl-450 libnvidia-gl-450:i386 libnvidia-ifr1-450 libnvidia-ifr1-450:i386 libnvrtc10.0 libnvtoolsext1 libnvvm3 libpciaccess0:i386 libprotobuf18 libprotoc18 libsensors4:i386 libsleef3 libstdc++6:i386
  libthrust-dev libvdpau-dev libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386 libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-sync1:i386 libxcb1:i386 libxdamage1:i386 libxdmcp6:i386
  libxext6:i386 libxfixes3:i386 libxnvctrl0 libxshmfence1:i386 libxxf86vm1:i386 pkg-config protobuf-compiler python-absl python-astor python-cffi python-configparser python-future python-gast python-grpcio
  python-leveldb python-networkx python-pasta python-ply python-protobuf python-pycparser python-pywt python-skimage python-skimage-lib python-termcolor python-typing python-wrapt python3-absl python3-astor
  python3-cffi python3-future python3-gast python3-grpcio python3-leveldb python3-markdown python3-networkx python3-pasta python3-ply python3-pycparser python3-pyinotify python3-pywt python3-skimage python3-skimage-lib
  python3-tensorflow-serving python3-termcolor python3-werkzeug python3-wrapt screen-resolution-extra xserver-xorg-video-nvidia-450
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 179 not upgraded.

第2.7节。处理冲突的安装方法

我运行了以下命令

代码语言:javascript
复制
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove cuda*  
sudo apt-get --purge remove nvidia*  
sudo apt-get --purge remove libcuda*  

我试着找

代码语言:javascript
复制
sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl

但是bin中没有这个名称的文件,所以我不认为前面的cuda是用runfile安装的。

我检查了nvidia-sminvcc -V,这两次都没有找到命令,而是在什么时候找到的。当我运行安装程序时,我一直收到一条警告消息--前面有一个安装程序,

现有的包管理器安装驱动程序。强烈建议您在继续之前删除此操作。

因此,我尝试了一些其他方法来删除cuda的安装。

代码语言:javascript
复制
sudo apt-get --purge remove cuda-11.0
sudo apt-get --purge remove cuda-11.1 
sudo apt-get --purge remove cuda-10.0 
sudo apt-get purge nvidia*
sudo apt-get remove --purge cuda-* libcuda* nvidia* 
sudo rm /etc/apt/sources.list.d/cuda*
sudo apt remove --autoremove nvidia-cuda-toolkit
sudo dpkg -l | grep nvidia
sudo apt purge cuda
sudo apt purge -y nvidia
sudo apt remove -y nvidia-*
sudo rm /etc/apt/sources.list.d/cuda*
sudo apt autoremove -y && apt autoclean -y
sudo rm -rf /usr/local/cuda*

第6节. Runfile安装

6.3。使新成员丧失能力

我运行了以下命令

代码语言:javascript
复制
touch /etc/modprobe.d/blacklist-nouveau.conf

再加上

代码语言:javascript
复制
blacklist nouveau
options nouveau modeset=0

那份文件。然后我处决了

代码语言:javascript
复制
update-initramfs: Generating /boot/initrd.img-5.4.0-52-generic

这导致

代码语言:javascript
复制
update-initramfs: Generating /boot/initrd.img-5.4.0-52-generic

然后我测试了lsmod | grep nouveau,看看它是否打印任何东西,但它没有。

然后我尝试了这个安装。

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64和目标_distro=Ubuntu&target_version=1804&target_type=runfilelocal

给出了这些命令

代码语言:javascript
复制
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run

我下载了安装程序并运行了sudo sh cuda_11.1.0_455.23.05_linux.run

这导致了这样的消息

代码语言:javascript
复制
 Installation failed. See log at /var/log/cuda-installer.log for details.

我打开了那个文件,这是内容

代码语言:javascript
复制
[INFO]: Driver not installed.
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting

所以看起来在驱动程序上安装失败了。我不知道是什么导致了这个错误,因为11.0以前已经安装到GPU上了。

然后我尝试通过deb安装。

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64和目标_distro=Ubuntu&target_version=1804&target_type=deblocal

给出了这些命令

代码语言:javascript
复制
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

最后一个命令似乎出现了一个错误,其余的命令似乎运行良好,没有问题。这是最后一个命令sudo apt-get -y install cuda的输出,它提供了这个输出

代码语言:javascript
复制
`Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-11-1 (>= 11.1.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

在尝试对驱动程序安装进行故障排除时,我发现sudo apt install nvidia-450-dev可能会工作,所以我尝试了一下,并且成功了。

nvidia-smi

显示如下:

代码语言:javascript
复制
Mon Oct 26 18:27:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:19:00.0 Off |                  N/A |
| 22%   31C    P8     1W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 22%   35C    P8     4W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:67:00.0 Off |                  N/A |
| 22%   37C    P8     6W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:68:00.0 Off |                  N/A |
| 22%   39C    P8     1W / 250W |     26MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  9MiB |
|    3   N/A  N/A      1653      G   /usr/bin/gnome-shell               14MiB |
+-----------------------------------------------------------------------------+

但是,驱动程序是11.0,而不是11.1。

因此,我尝试安装和旧版本的cuda,11.0,而不是11.1。

这只适用于驱动程序,而不是运行时API。

运行nvcc -V会给出"bash: /usr/bin/nvcc:没有这样的文件或目录“

然后我尝试安装11.0,因为运行时API应该比驱动程序版本低或相等。

从…

https://developer.nvidia.com/cuda-11.0-download-archive

我选择了这个安装https://developer.nvidia.com/cuda-11.0-download-archive?target_os=Linux&target_arch=x86_64和目标_distro=Ubuntu&target_version=1804&target_type=runfilelocal

发出了以下命令,

代码语言:javascript
复制
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
sudo sh cuda_11.0.2_450.51.05_linux.run

下载安装程序后,运行sudo sh cuda_11.0.2_450.51.05_linux.run

首先给我一个警告,一个以前的版本正在重新安装,可能是从驱动程序安装。我选择继续,因为我将只安装工具包,而不是驱动程序。我继续,并选择安装除了驱动程序以外的所有东西。

代码语言:javascript
复制
 CUDA Installer                                                               │
│ - [ ] Driver                                                                 │
│      [ ] 450.51.05                                                           │
│ + [X] CUDA Toolkit 11.0                                                      │
│   [X] CUDA Samples 11.0                                                      │
│   [X] CUDA Demo Suite 11.0                                                   │
│   [X] CUDA Documentation 11.0                                                │
│   Options                                                                    │
│   Install                                                                    │
│                                                                              │
│                                                                              │
│                         

安装完毕后,我收到一条消息

代码语言:javascript
复制
===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.0/
Samples:  Installed in /home/santosh/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.0/lib64, or, add /usr/local/cuda-11.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.0/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.0/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least .00 is required for CUDA 11.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

我将/usr/local/cuda-11.0/bin添加到PATH中,并将LD_LIBRARY_PATH设置为/usr/local/cuda-11.0/lib64 64

然后我在这里尝试了post安装说明,https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup

systemctl status nvidia-persistenced导致"Unit Persistenced.service被找不到“。

sudo systemctl enable nvidia-persistenced导致

代码语言:javascript
复制
The unit files have no installation config (WantedBy, RequiredBy, Also, Alias
settings in the [Install] section, and DefaultInstance for template units).
This means they are not meant to be enabled using systemctl.
Possible reasons for having this kind of units are:
1) A unit may be statically enabled by being symlinked from another unit's
   .wants/ or .requires/ directory.
2) A unit's purpose may be to act as a helper for some other unit which has
   a requirement dependency on it.
3) A unit may be started when needed via activation (socket, path, timer,
   D-Bus, udev, scripted systemctl call, ...).
4) In case of template units, the unit is meant to be enabled with some
   instance name specified.

我能够在没有问题的情况下执行udeve规则指令;我运行了以下命令

代码语言:javascript
复制
sudo cp /lib/udev/rules.d/40-vm-hotadd.rules /etc/udev/rules.d
sudo sed -i '/SUBSYSTEM=="memory", ACTION=="add"/d' /etc/udev/rules.d/40-vm-hotadd.rules

我试过nvcc -V只是为了检查安装是否在某种程度上起作用。这一次我收到一条消息

代码语言:javascript
复制
Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit

所以我尝试了这个命令,它的安装似乎没有问题。当我再次运行nvcc -V时,我收到了一条消息

代码语言:javascript
复制
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

这是我刚开始讲的CUDA的版本。

看这条消息

https://forums.developer.nvidia.com/t/cuda-10-installation-problems-on-ubuntu-18-04/68615

按照linux安装指南中的说明:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html 836从http://www.nvidia.com/getcuda 267获得安装程序,既然已经安装了错误的驱动程序,请仔细阅读linux安装指南。如果不小心遵守,就会带来更多的麻烦。

似乎不推荐在gpu和工具包上安装其他方法(使用sudo apt install nvidia-450-devsudo apt install nvidia-cuda-toolkit) ),应该严格遵循指南。

但是,我遵守了指令,它无法安装到驱动程序上。驱动程序的安装看起来并不是不可能的,因为替代命令某种程度上起了作用,但是错误日志并没有给我任何关于我如何能够以官方的方式安装它的见解。

EN

回答 1

Ask Ubuntu用户

发布于 2020-10-27 18:45:04

我解决了这个问题。硬件附带了自己的cuda安装文件,我不知道这些文件。一旦这些被封锁,安装就完美地工作了。

票数 1
EN
页面原文内容由Ask Ubuntu提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://askubuntu.com/questions/1287290

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档