首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Tensorflow无法使用所有可见的GPU

Tensorflow无法使用所有可见的GPU
EN

Stack Overflow用户
提问于 2018-03-29 18:52:00
回答 1查看 883关注 0票数 1

我有一个带有8个GPU的集群,我想在它上运行一个python脚本。我知道脚本很好,因为它运行在一个GPU集群上。但是,当试图在这8 gpu集群上运行时,我会收到以下错误消息:

代码语言:javascript
复制
to use: AVX2 AVX512F FMA
2018-03-29 18:42:51.800702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3d:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.347624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3e:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.882324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:60:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:53.591909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:61:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.149671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 4 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b1:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.715701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 5 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b2:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.286011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 6 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:da:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.874676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 7 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:db:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.929779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2018-03-29 18:42:55.930506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 2 3 4 5 6 7
2018-03-29 18:42:55.930524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 2:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 3:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 4:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 5:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 6:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 7:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-03-29 18:43:00.106517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-29 18:43:00.572522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:43:01.039866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-29 18:43:01.512332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-29 18:43:02.036327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-29 18:43:02.679167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1) 
killed

它只是温和地说,killed和我不知道为什么会发生这个错误。我尝试使用以下命令指定两个GPU:

代码语言:javascript
复制
CUDA_VISIBLE_DEVICES=0,1 python3 my_script.py

但是打印了以下错误:

代码语言:javascript
复制
Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:47:46.208490: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-29 18:47:46.210296: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Aborted (core dumped)

我使用以下命令安装tensorflow-gpu:

代码语言:javascript
复制
pip3 install tensorflow-gpu 
pip3 install --upgrade tensorflow-gpu

这可能与“激活”tensorflow有关吗?我不知道如何在集群上这样做,因为我不确定这是否考虑到了虚拟环境

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-04-02 14:10:16

您需要降低cuDNN版本的级别。我用7.0.5解决了这个问题。

下载cuDNN v7.0.5 (2017年12月5日),for CUDA 9.0

cuDNN v7.0.5Library for Linux下载cuDNN文件。

(关于Ubuntu 16)

在此之前,您需要删除所有cuDNN文件:

代码语言:javascript
复制
sudo rm -rf /usr/local/cuda/include/cudnn.h
sudo rm -rf /usr/local/cuda/lib64/libcudnn*

现在从下载的文件中提取新的cuDNN:

代码语言:javascript
复制
tar xvzf cudnn-9.0-linux-x64-v7.tgz

将新文件移动到cuda目录:

代码语言:javascript
复制
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include    
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64

设置此文件的权限:

代码语言:javascript
复制
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49563179

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档