文章/答案/技术大牛

发布

社区首页 >问答首页 >对设备数组的子集应用函子的最有效方法是什么？

问对设备数组的子集应用函子的最有效方法是什么？
EN

Stack Overflow用户

提问于 2019-02-13 02:20:21

回答 1查看 246关注 0票数 0

我正在重写一个库，它对存储在连续内存块中的数据执行计算和其他操作，以便它可以使用CUDA框架在GPU上工作。数据表示生活在四维网格上的信息。网格的总尺寸可以从1000个到数百万个网格点不等。沿着每个方向，网格可能只有8点，或者多达100点。我的问题是，在网格子集上实现操作的最佳方法是什么。例如，假设我的网格是[0，nx)x[0，ny)x[0，nz)x[0，nq)，并且我想实现一个转换，将索引属于[ 1，nx-1)x[1，ny-1)x[1，nz-1)x[0，nq-1)的所有点乘以-1。

现在，我做的是通过嵌套循环。这是一个代码骨架

{ 
int nx,ny,nz,nq;
nx=10,ny=10,nz=10,nq=10;
typedef thrust::device_vector<double> Array;
Array A(nx*ny*nz*nq);
thrust::fill(A.begin(), A.end(), (double) 1);

for (auto q=1; q<nq-1; ++q){
for (auto k=1; k<nz-1; ++k){
for (auto j=1; j<ny-1; ++j){
int offset1=+1+j*nx+k*nx*ny+q*nx*ny*nz;
int offset2=offset1+nx-2;
thrust::transform(A.begin()+offset1, 
                  A.begin()+offset2, 
                  thrust::negate<double>());
      }
    }
  }
}

但是，我想知道这是否是最有效的方法，因为在我看来，在这种情况下，最多只能同时运行nx-2线程。所以我想，也许更好的方法是生成一个序列迭代器(返回数组的线性位置)，使用zip迭代器将其压缩到数组中，并定义一个函子来检查元组的第二个元素(位置值)，如果该值属于可接受的范围，则修改元组的第一个元素。然而，也许有更好的方法来做到这一点。我是CUDA的新手，更糟糕的是，我真的和Fortran一起切了牙，所以我很难从这个循环的盒子里想出来.

c++

cuda

thrust

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-02-13 04:36:22

我不知道最有效的方法是什么。我可以提出比你的框架代码更有效的建议。

你在案文中的建议是朝着正确的方向发展的。与其使用一组嵌套的for-循环(可能会多次迭代)，不如在一次推力调用中完成所有工作。但是我们仍然需要有一个推力调用只修改要操作的“立方”卷内索引的数组值。

但是，我们不想使用一种方法，其中包括根据有效的索引卷测试生成的索引，正如您所建议的那样。这将要求我们启动一个和我们的数组一样大的网格，即使我们只想修改其中的一小部分。

相反，我们启动一个足以满足修改所需元素数的操作，并创建一个函子，它执行线性索引-> 4D索引->调整后的线性索引转换。然后，该函子在转换迭代器中操作，将从0、1、2等开始的普通线性序列转换为开始并保持在待修改卷内的序列。然后，置换迭代器与此修改的序列一起使用，以选择要修改的数组的值。

下面是一个示例，显示嵌套循环方法(1)与我的(2)在64x64x64x64数组和修改后的卷62x62x62x62之间的时间差：

$ cat t39.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/equal.h>
#include <cassert>
#include <iostream>

struct my_idx
{
  int nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;
  my_idx(int _nx, int _ny, int _nz, int _nq, int _lx, int _ly, int _lz, int _lq, int _hx, int _hy, int _hz, int _hq) {
    nx = _nx;
    ny = _ny;
    nz = _nz;
    nq = _nq;
    lx = _lx;
    ly = _ly;
    lz = _lz;
    lq = _lq;
    dx = _hx - lx;
    dy = _hy - ly;
    dz = _hz - lz;
    dq = _hq - lq;
    // could do a lot of assert checking here
  }

  __host__ __device__
  int operator()(int idx){
    int rx = idx / dx;
    int ix = idx - (rx * dx);
    int ry = rx / dy;
    int iy = rx - (ry * dy);
    int rz = ry / dz;
    int iz = ry - (rz * dz);
    int rq = rz / dq;
    int iq = rz - (rq * dq);
    return (((iq+lq)*nz+iz+lz)*ny+iy+ly)*nx+ix+lx;
  }
};

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


int main()
{
  int nx,ny,nz,nq,lx,ly,lz,lq,hx,hy,hz,hq;
  nx=64,ny=64,nz=64,nq=64;
  lx=1,ly=1,lz=1,lq=1;
  hx=nx-1,hy=ny-1,hz=nz-1,hq=nq-1;
  thrust::device_vector<double> A(nx*ny*nz*nq);
  thrust::device_vector<double> B(nx*ny*nz*nq);
  thrust::fill(A.begin(), A.end(), (double) 1);
  thrust::fill(B.begin(), B.end(), (double) 1);
  // method 1
  unsigned long long m1_time = dtime_usec(0);
  for (auto q=lq; q<hq; ++q){
    for (auto k=lz; k<hz; ++k){
      for (auto j=ly; j<hy; ++j){
        int offset1=lx+j*nx+k*nx*ny+q*nx*ny*nz;
        int offset2=offset1+(hx-lx);
        thrust::transform(A.begin()+offset1,
                  A.begin()+offset2, A.begin()+offset1,
                  thrust::negate<double>());
      }
    }
  }
  cudaDeviceSynchronize();
  m1_time = dtime_usec(m1_time);

  // method 2
  unsigned long long m2_time = dtime_usec(0);
  auto p = thrust::make_permutation_iterator(B.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_idx(nx, ny, nz, nq, lx, ly, lz, lq, hx, hy, hz, hq)));
  thrust::transform(p, p+(hx-lx)*(hy-ly)*(hz-lz)*(hq-lq), p, thrust::negate<double>());
  cudaDeviceSynchronize();
  m2_time = dtime_usec(m2_time);
  if (thrust::equal(A.begin(), A.end(), B.begin()))
    std::cout << "method 1 time: " << m1_time/(float)USECPSEC << "s method 2 time: " << m2_time/(float)USECPSEC << "s" << std::endl;
  else
    std::cout << "mismatch error" << std::endl;
}
$ nvcc -std=c++11 t39.cu -o t39
$ ./t39
method 1 time: 1.6005s method 2 time: 0.013182s
$

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54661513

复制

相似问题

问对设备数组的子集应用函子的最有效方法是什么？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对设备数组的子集应用函子的最有效方法是什么？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对设备数组的子集应用函子的最有效方法是什么？
EN