文章/答案/技术大牛

发布

社区首页 >问答首页 >在CUDA内核中传递常量整数

问在CUDA内核中传递常量整数
EN

Stack Overflow用户

提问于 2017-06-28 22:19:11

回答 1查看 1.5K关注 0票数 0

下面的代码有问题。在全局内核loop_d中，M的整数值为84。当我尝试创建一个共享数组，temp，并使用M作为数组的大小时，我会得到以下错误：

错误:表达式必须有一个常量值

我不知道这是为什么。我知道，如果我声明M为一个全局变量，那么它是有效的，但问题是，我通过在另一个Fortran程序中调用函数d_two来得到M的值，所以我不知道如何绕过它。我知道如果我用tempM代替temp84，那么我的程序运行得很完美，但这并不是很实际，因为不同的问题可能有不同的M值。谢谢您的帮助！

程序

// Parallelized 2D Three-Point Guassian Quadrature Numerical Integration Method
// The following program is part of two linked programs, Integral_2D_Cuda.f. 
// This is a CUDA kernel that could be called in the Integral_2D_Cuda.f Fortran code to compute
// the integral of a given 2D-function
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
// The following is a definition for the atomicAddd function that is called in the loop_d kernel
// This is needed because the "regular" atomicAdd function only works for floats and integers
__device__ double atomicAddd(double* address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
            __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}
// GPU kernel that computes the function of interest. This is good for a two dimensional problem.
__global__ void loop_d(double *a_sx, double *b_swx, double *c_sy, double *d_swy, double *e_ans0, int N, int M)
{
    // Declaring a shared array that threads of the same block have access to
    __shared__ double temp[M];
    int idxX = blockIdx.x * blockDim.x + threadIdx.x; // Thread indices responsible for the swx and sx arrays
    int idxY = threadIdx.y;     // Thread indices responsible for the swy and sy arrays
    // Computing the multiplication of elements
    if (idxX < N && idxY < M)
    {
        temp[idxY] = a_sx[idxX] * b_swx[idxX] * c_sy[idxY] * d_swy[idxY];
    }
    // synchronizing all threads before summing all the mupltiplied elements int he temp array
    __syncthreads();
    // Allowing the 0th thread of y to do the summation of the multiplied elements in the temp array of one block 
    if (0 == idxY)
    {
        double sum = 0.00;
        for(int k = 0; k < M; k++)
        {
            sum = sum + temp[k];
        }
        // Adding the result of this instance of calculation to the final answer, ans0
        atomicAddd(e_ans0, sum);
    }
}
extern "C" void d_two_(double *sx, double *swx, int *nptx, double *sy, double *swy, int *npty, double *ans0)
{
    // Assigning GPU pointers
    double *sx_d, *swx_d;
    int N = *nptx;
    double *sy_d, *swy_d;
    int M = *npty;
    double *ans0_d;
    dim3 threadsPerBlock(1,M); // Creating a two dimesional block with 1 thread in the x dimesion and M threads in the y dimesion
    dim3 numBlocks(N); // specifying the number of blocks to use of dimesion 1xM
    // Allocating GPU Memory
    cudaMalloc( (void **)&sx_d, sizeof(double) * N);
    cudaMalloc( (void **)&swx_d, sizeof(double) * N);
    cudaMalloc( (void **)&sy_d, sizeof(double) * M);
    cudaMalloc( (void **)&swy_d, sizeof(double) * M);
    cudaMalloc( (void **)&ans0_d, sizeof(double) );
    // Copying information fromm CPU to GPU
    cudaMemcpy( sx_d, sx, sizeof(double) * N, cudaMemcpyHostToDevice );
    cudaMemcpy( swx_d, swx, sizeof(double) * N, cudaMemcpyHostToDevice );
    cudaMemcpy( sy_d, sy, sizeof(double) * M, cudaMemcpyHostToDevice );
    cudaMemcpy( swy_d, swy, sizeof(double) * M, cudaMemcpyHostToDevice );
    cudaMemcpy( ans0_d, ans0, sizeof(double), cudaMemcpyHostToDevice );
    // Calling the function on the GPU
    loop_d<<< numBlocks, threadsPerBlock >>>(sx_d, swx_d, sy_d, swy_d, ans0_d, N, M);
    // Copying from GPU to CPU
    cudaMemcpy( ans0, ans0_d, sizeof(double), cudaMemcpyDeviceToHost );
    // freeing GPU memory
    cudaFree(sx_d);
    cudaFree(swx_d);
    cudaFree(sy_d);
    cudaFree(swy_d);
    cudaFree(ans0_d);
    return;
}

c++

cuda

gpu

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-06-28 22:26:05

编译器需要M是编译时常量.在编译时，它无法确定M实际上将是什么(它不知道最终只会传递84 )。

当您想要使用您只知道在运行时才知道的共享内存时，则使用动态共享内存。

请参阅网站上的这个例子或Parallel4All博客上的在CUDA中使用共享内存。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44813437

复制

相似问题

问在CUDA内核中传递常量整数
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在CUDA内核中传递常量整数EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在CUDA内核中传递常量整数
EN