我有一个带有8个GPU的集群,我想在它上运行一个python脚本。我知道脚本很好,因为它运行在一个GPU集群上。但是,当试图在这8 gpu集群上运行时,我会收到以下错误消息:
to use: AVX2 AVX512F FMA
2018-03-29 18:42:51.800702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3d:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.347624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3e:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.882324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:60:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:53.591909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:61:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.149671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 4 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b1:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.715701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 5 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b2:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.286011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 6 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:da:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.874676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 7 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:db:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.929779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2018-03-29 18:42:55.930506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 2 3 4 5 6 7
2018-03-29 18:42:55.930524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 2: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 3: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 4: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 5: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 6: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 7: Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-03-29 18:43:00.106517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-29 18:43:00.572522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:43:01.039866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-29 18:43:01.512332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-29 18:43:02.036327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-29 18:43:02.679167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1)
killed它只是温和地说,killed和我不知道为什么会发生这个错误。我尝试使用以下命令指定两个GPU:
CUDA_VISIBLE_DEVICES=0,1 python3 my_script.py但是打印了以下错误:
Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:47:46.208490: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-29 18:47:46.210296: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Aborted (core dumped)我使用以下命令安装tensorflow-gpu:
pip3 install tensorflow-gpu
pip3 install --upgrade tensorflow-gpu这可能与“激活”tensorflow有关吗?我不知道如何在集群上这样做,因为我不确定这是否考虑到了虚拟环境
发布于 2018-04-02 14:10:16
您需要降低cuDNN版本的级别。我用7.0.5解决了这个问题。
下载cuDNN v7.0.5 (2017年12月5日),for CUDA 9.0
从cuDNN v7.0.5Library for Linux下载cuDNN文件。
(关于Ubuntu 16)
在此之前,您需要删除所有cuDNN文件:
sudo rm -rf /usr/local/cuda/include/cudnn.h
sudo rm -rf /usr/local/cuda/lib64/libcudnn*现在从下载的文件中提取新的cuDNN:
tar xvzf cudnn-9.0-linux-x64-v7.tgz将新文件移动到cuda目录:
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64设置此文件的权限:
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*https://stackoverflow.com/questions/49563179
复制相似问题