文章/答案/技术大牛

发布

社区首页 >问答首页 >"threadgroup_barrier“无关紧要

问"threadgroup_barrier“无关紧要
EN

Stack Overflow用户

提问于 2019-08-29 20:07:00

回答 1查看 444关注 0票数 1

目前，我正在使用Metal计算着色器，并试图了解GPU线程同步是如何在那里工作的。

我写了一个简单的代码，但它并不像我期望的那样工作：

考虑一下我有一个threadgroup变量，它是一个数组，所有的线程都可以同时产生一个输出。

    kernel void compute_features(device float output [[ buffer(0) ]],
                                 ushort2 group_pos [[ threadgroup_position_in_grid ]],
                                 ushort2 thread_pos [[ thread_position_in_threadgroup]],
                                 ushort tid [[ thread_index_in_threadgroup ]])
    {     
        threadgroup short blockIndices[288];

        float someValue = 0.0
        // doing some work here which fills someValue...

        blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;

        //wait when all threads are done with calculations
        threadgroup_barrier(mem_flags::mem_none);  
        output += blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x]; // filling out output variable with threads calculations
    }

上面的代码不起作用。Output变量不包含所有线程计算，它只包含线程的值，该值可能是最后一个将值加到output中的值。在我看来，threadgroup_barrier似乎什么都不做。

现在，有趣的部分。下面的代码可以工作：

blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;

threadgroup_barrier(mem_flags::mem_none);  //wait when all threads are done with calculations
if (tid == 0) {
    for (int i = 0; i < 288; i ++) {
        output += blockIndices[i]; // filling out output variable with threads calculations
    }
}

这段代码也和前面的代码一样好用：

blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;

if (tid == 0) {
    for (int i = 0; i < 288; i ++) {
        output += blockIndices[i]; // filling out output variable with threads calculations
    }
}

总而言之:只有当我在一个GPU线程中处理线程组内存时，我的代码才能正常工作，无论它的id是什么，它都可以是线程组中的最后一个线程，也可以是第一个线程。有没有threadgroup_barrier绝对不会有区别。我也使用了带有mem_threadgroup标志的threadgroup_barrier，代码仍然不能工作。

我知道我可能遗漏了一些非常重要的细节，如果有人能指出我的错误，我会很高兴。提前感谢！

metal

compute-shader

threadgroup

multithreading

gpu

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-29 22:16:44

当您编写output += blockIndices[...]时，所有线程都将尝试同时执行此操作。但由于output不是原子变量，这会导致竞争条件。它不是threadsafe操作。

您的第二个解决方案是正确的。您只需要有一个线程来收集结果(尽管您也可以将其拆分到多个线程中)。如果你移除了障碍，它仍然可以正常工作，这可能只是因为运气。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57709877

复制

相似问题

问"threadgroup_barrier“无关紧要
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问"threadgroup_barrier“无关紧要EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问"threadgroup_barrier“无关紧要
EN