文章/答案/技术大牛

发布

社区首页 >问答首页 >使用共享内存计算点间距离

问使用共享内存计算点间距离
EN

Stack Overflow用户

提问于 2019-07-28 18:23:09

回答 1查看 362关注 0票数 1

我试图计算所有点之间的距离(公制加权)。为了加快速度，我在gpu上并通过cuda和numba完成了这个操作，因为我认为它更易读和更容易使用。

我有两个一维点数组，想要计算同一数组中所有点之间的距离以及两个数组之间的所有点之间的距离。我编写了两个cuda内核，一个只使用全局内存，我已经验证了它使用cpu代码给出了正确的答案。就是这个。

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

我现在正试图通过使用共享内存来进一步优化这一点。这就是我到目前为止所拥有的。

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                                                                           
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                                                            
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                                                                        

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                                                                   
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count):
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                                                                       
            cuda.syncthreads()


            if((tx < count)):
                for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                    if(copyIdx != tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                                                             
            #arrSum[tx] += tempSum                                                                                                                                                                          
            cuda.syncthreads()
    arrSum[tx] += tempSum

我相信我对同步线程很小心，但是这个答案总是太大了(大约5%)。我猜一定有一些争用条件，但据我所知，每个线程都写入一个唯一的索引，并且tempSum变量对每个线程都是本地的，所以不应该有任何争用条件。我确信我的for循环条件是正确的。如有任何建议，将不胜感激。谢谢。

python

cuda

numba

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-28 21:34:13

最好提供完整的代码。这应该是直截了当的，通过对您所展示的内容进行简单的添加--正如我在下面所做的那样。然而，即使有一组限制性的假设，两种实现也有差异。

我假定：

总体数据集大小是线程块大小的整数倍。
您启动的线程总数与数据集的大小完全相同。

我也不打算尝试和评论你的共享实现是否有意义，即应该比非共享实现表现得更好。这似乎不是你问题的症结所在，这就是为什么你得到了两个实现之间的数值差异。

主要问题是，在每种情况下，选择要计算成对的“距离”的元素的方法都不匹配。在非共享实现中，对于输入数据集中的每个元素i，您正在计算i与大于i的每个元素之间的距离之和。

for i in range(tx+1,A.size):
               ^^^^^^^^^^^

要和的项的选择与共享实现不匹配：

            for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                if(copyIdx != tx):

这里有几个问题，但是很明显，对于复制的每个块，位于threadIdx.x位置的给定元素只有在块(数据)中的目标元素大于该索引时才会更新其和。这意味着，当您以块的方式遍历整个数据集时，您将跳过每个块中的元素。这不可能与非共享实现相匹配。如果这并不明显，只需为for循环的范围选择实际值。假设cuda.threadIdx.x为5，cuda.blockDim.x为32。然后，该特定元素将只计算整个数组中每个数据块中的项目6-31的和。

解决这个问题的方法是强迫共享实现与非共享的实现保持一致，因为它是如何选择元素来帮助运行和的。

此外，在非共享实现中，您只更新输出点一次，并且执行直接分配：

arrSum[tx] = tempSum

在共享实现中，您仍然只更新输出点一次，但是没有直接分配。我更改了这个以匹配非共享的：

arrSum[tx] += tempSum

下面是一个完整的代码，其中包含了这些问题：

$ cat t49.py
from numba import cuda
import numpy as np
import math
import time
from numba import float32

sigma = np.float32(1.0)
tpb = 32

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                               
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                            

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                           
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count): #this should always be true
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                           
            cuda.syncthreads()


            if((tx < count)):    #this should always be true
                for i in range(cuda.blockDim.x):
                    if(copyIdx-cuda.threadIdx.x+i > tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                        
            #arrSum[tx] += tempSum                                                                                                                              
            cuda.syncthreads()
    arrSum[tx] = tempSum

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

size = 128
threads_per_block = tpb
blocks = (size + (threads_per_block - 1)) // threads_per_block
my_in  = np.ones( size, dtype=np.float32)
my_out = np.zeros(size, dtype=np.float32)
gpuSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
gpuSharedSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
$ python t49.py
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
$

请注意，如果我的两个假设中的任何一个被违反，您的代码就会有其他问题。

在未来，我鼓励您提供一个简短、完整的代码，如我前面所示。对于这样一个问题，它不应该是很多额外的工作。如果没有其他原因(还有其他原因)，那么当您已经有了这段代码时，强迫其他人从头开始编写这段代码是很乏味的，以证明所提供的答案的敏感性。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57243804

复制

相似问题

问使用共享内存计算点间距离
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用共享内存计算点间距离EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用共享内存计算点间距离
EN