文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在数据自动化系统中有效地打包位元？

问如何在数据自动化系统中有效地打包位元？
EN

Stack Overflow用户

提问于 2016-09-14 10:42:59

回答 2查看 1.8K关注 0票数 2

我有一个字节数组，每个字节要么是0，要么是1。现在我想把这些值打包成比特，这样8个原始字节占据了一个目标字节，原始字节0变成了位0，字节1变成了位1，等等。到目前为止，内核中有以下内容：

const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];

// ... Computation of the original bytes in packing[tid]
__syncthreads();

if ((tid & 4) == 0)
{
    packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
    packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
    pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}

这是正确和有效的吗？

c++

parallel-processing

cuda

bit-packing

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-09-14 10:57:20

__ballot()的扭曲投票功能在这方面非常有用。假设您可以将pOutput重新定义为uint32_t类型，并且您的块大小是翘曲尺寸(32)的倍数：

unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}

严格地说，if条件甚至是不必要的，因为翘曲的所有线程都会将相同的数据写入相同的地址。所以一个高度优化的版本

pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);

票数 8

Stack Overflow用户

发布于 2016-09-15 15:12:13

对于每个线程有两个位，使用uint2 *pOutput

int lane = tid % warpSize;
uint2 target;
target.x = __ballot(__shfl(packing[tid], lane / 2)                & (lane & 1) + 1));
target.y = __ballot(__shfl(packing[tid], lane / 2 + warpSize / 2) & (lane & 1) + 1));
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;

您必须测试这是否仍然比您的传统解决方案更快。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39488441

复制

相似问题

问如何在数据自动化系统中有效地打包位元？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在数据自动化系统中有效地打包位元？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在数据自动化系统中有效地打包位元？
EN