文章/答案/技术大牛

发布

社区首页 >问答首页 >Tensorflow的while循环在GPU上运行缓慢？

问Tensorflow的while循环在GPU上运行缓慢？
EN

Stack Overflow用户

提问于 2018-06-20 22:47:44

回答 2查看 2.1K关注 0票数 5

由于未知的原因，下面的代码在GPU上比在CPU上慢两倍。有人能解释一下原因吗：

import time
import tensorflow as tf

with tf.device('/device:GPU:0'):  # gpu takes: 5.132448434829712 seconds
    # with tf.device('/cpu:0'): # cpu takes: 3.440524101257324 seconds
    i = tf.constant(0)
    while_condition = lambda i: tf.less(i, 2 ** 20)
    a = tf.fill([16, 16], 1.1)
    b = tf.fill([16, 16], 2.2)
    def body(i):
        res = tf.matmul(a, b)
        # increment i
        add = tf.add(i, 1)

        return (add,)


    ini_matmul = tf.matmul(a, b)

    # do the loop:
    loop = tf.while_loop(while_condition, body, [i])

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(ini_matmul)  # force GPU to initilise anything it needs.

    t0 = time.time()
    sess.run(loop)

    t1 = time.time()
    print(t1 - t0)
sess.close()

注意:通常情况下，GPU运行5秒，CPU运行3秒，使用numpy的CPU版本仅运行1.5秒。硬件:运行在谷歌Colab上的Tensorflow代码。在本地英特尔酷睿i5-7267U上运行的Numpy代码。

Numpy版本：

import numpy as np
import time

i = 0
a = np.full([16,16],1.1)
b = np.full([16,16],2.2)

t0 = time.time()

while i < 2**20:
    a.dot(b)
    i += 1

t1 = time.time()

print(t1-t0)

更新

这对我来说越来越有意义了，因为扩展矩阵并没有真正的帮助。以下是其中的更新代码和数据(运行泰坦XP卡/英特尔i7处理器)。从本质上讲，tensorflow的运行速度要慢得多。

import time
import tensorflow as tf

dimension = 11
repeat = 2**10
use_gpu = False
# Device: /device:GPU:0, Dimension 11, Repeat: 1024, Time cost: 0.00457597 seconds.
# Device: /cpu:0, Dimension 11, Repeat: 1024, Time cost: 0.00353599 seconds.

dev_name = '/device:GPU:0' if use_gpu else '/cpu:0'

with tf.device(dev_name):  
    i = tf.constant(0)
    while_condition = lambda i: tf.less(i, repeat)
    a = tf.constant(1.1, shape=[2**dimension, 2**dimension])
    b = tf.constant(2.2, shape=[2**dimension, 2**dimension])
    def body(i):
        res = tf.matmul(a, b)
        add = tf.add(i, 1)
        return (add,)
    ini_matmul = tf.matmul(a, b)
    # do the loop:
    loop = tf.while_loop(while_condition, body, [i])

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(ini_matmul)  # force initialisation.

    t0 = time.time()
    sess.run(loop)
    t1 = time.time()
    print('Device: {dev}, Dimension {dim:d}, Repeat: {r:d}, Time cost: {t:.8f} seconds.'.format(
        dev = dev_name,
        dim = dimension, r = repeat,
        t = t1 - t0
    ))
sess.close()

tensorflow

回答 2

Stack Overflow用户

发布于 2018-06-26 00:11:30

最后，我发现matmul操作不是由tensorflow执行的，因为它是图中的孤立节点。

票数 2

Stack Overflow用户

发布于 2018-06-21 06:49:44

这是一个有趣的问题。

你在TensorFlow代码片段中看到的GPU和CPU执行速度的相对减慢几乎可以肯定是由于GPU memory allocation overhead造成的。总而言之，cudaMalloc比malloc慢。当且仅当加速比超过内存分配时间差时，请求操作(在本例中为matmul)中的加速才会抵消这种内存分配减慢。当矩阵很大时，这对于matmul总是正确的。当矩阵很小时，这不是真的，就像您的示例中的情况一样。为了验证这一假设，迭代地增加被乘数的大小，并记录CPU和GPU的运行时间-如果内存分配确实是问题所在，这两者应该收敛，然后交叉。

Numpy运行时间和仅CPU运行时间之间的差异可能是由于Numpy和TensorFlow代码之间非常细微的差异造成的。请注意，在Numpy代码中，您只实例化了一次a和b。看起来您在TensorFlow代码中做了同样的事情，因为您只调用了一次初始化，但您仍然在每次迭代中填充张量！要了解原因，请注意tf.fill返回了一个Tensor。根据定义，每次对包含Tensor对象的图形调用sess.run时，都会填充这些对象。因此，这两个代码片段实际上做的事情略有不同。更直接的比较是将a和b都设为TensorFlow片段中的tf.constant。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50951007

复制

相似问题

问Tensorflow的while循环在GPU上运行缓慢？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow的while循环在GPU上运行缓慢？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow的while循环在GPU上运行缓慢？
EN