我有一个安装了Ubuntu20.04的系统,因此为Tensorflow获得CUDA和cudnn的正确组合似乎有点棘手。我试过CUDA11,但是我不能让cudnn工作,所以我通过sudo apt install nvidia-cuda-toolkit和相应的cudnn (7.6.5) (some helpful answers)安装了CUDA10.1。现在,当我安装Tensorflow-gpu 2时,我可以很容易地检查它是否使用gpu作为:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) 这给出了正确的输出2。但是,我需要使用Tensorflow-gpu-1.15。然后,我根据answer in this SO post尝试了以下方法
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))它给出了以下输出:
2020-07-11 14:05:53.181428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.183404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
2020-07-11 14:05:53.183598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.185222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2020-07-11 14:05:53.185548: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.185790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186015: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186237: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186459: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186578: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-11 14:05:53.186601: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-11 14:05:53.187652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-11 14:05:53.187669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
Traceback (most recent call last):
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
self._extend_graph()
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: {{node MatMul}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
[[MatMul]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMul (defined at /home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
[[MatMul]]我不能理解这个问题,我是否需要一个不同的CUDA版本(10.0可能是因为缺少库),或者我是否应该在tf.device中更改设备名称,我不确定是否可以在ubuntu20上安装CUDA10.0,因此可能需要安装较旧的ubuntu版本?
发布于 2020-07-11 18:27:20
我也遇到过类似的问题,让CUDA正常工作。我的解决方案是降级到Ubuntu 18.04,并确保我拥有测试构建配置中列出的正确的gcc、CUDA和tensorflow组合:
我的解决方案的原始记录记录在这个StackOverflow问题中:
发布于 2020-07-11 17:24:47
您在使用libcublas, libcurand, etc等Cuda库时遇到了问题。尝试正确设置LD_LIBRARY_PATH并检查/usr/lib/cuda/include:/usr/lib/cuda/lib64包含的目录。
https://stackoverflow.com/questions/62847255
复制相似问题