首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >OpenCL:一个内核能调用另一个内核吗?

OpenCL:一个内核能调用另一个内核吗?
EN

Stack Overflow用户
提问于 2014-09-18 10:46:33
回答 2查看 881关注 0票数 4

嗨,

我试图在OpenCL中运行可用的卷积代码。

我有一个异构系统-

1) CPU

2) GPU

PFB在我的系统中运行的代码库:

convolution.cl

代码语言:javascript
复制
// TODO: Add OpenCL kernel code here.
__kernel 
void convolve(
    const __global uint * const input,
    __constant uint     * const mask,
    __global uint       * const output,
    const int                   inputWidth,
    const int                   maskWidth){

        const int x = get_global_id(0);
        const int y = get_global_id(1);

        uint sum = 0;

        for (int r = 0; r < maskWidth; r++)
        {
            const int idxIntmp = (y + r) * inputWidth + x;
            for (int c = 0; c < maskWidth; c++)
            {
                sum += mask[(r * maskWidth) + c] * input[idxIntmp + c];
            }
        }

        output[y * get_global_size(0) + x] = sum;
}

而convolution.cpp -

//卷积-将3×3掩码应用于8×8输入信号,从而产生6×6输出信号的过程

代码语言:javascript
复制
    #include "CL/cl.h"
    #include "vector"
    #include "iostream"
    #include "time.h"

    #include <fstream>
    #include <sstream>
    #include <string>

using namespace std;

// Constants
const unsigned int inputSignalWidth = 8;
const unsigned int inputSignalHeight = 8;

cl_uint inputSignal[inputSignalWidth][inputSignalHeight] =
{
    {3, 1, 1, 4, 8, 2, 1, 3},
    {4, 2, 1, 1, 2, 1, 2, 3},
    {4, 4, 4, 4, 3, 2, 2, 2},
    {9, 8, 3, 8, 9, 0, 0, 0},
    {9, 3, 3, 9, 0, 0, 0, 0},
    {0, 9, 0, 8, 0, 0, 0, 0},
    {3, 0, 8, 8, 9, 4, 4, 4},
    {5, 9, 8, 1, 8, 1, 1, 1}
};

const unsigned int outputSignalWidth = 6;
const unsigned int outputSignalHeight = 6;

cl_uint outputSignal[outputSignalWidth][outputSignalHeight];

const unsigned int maskWidth = 3;
const unsigned int maskHeight = 3;

cl_uint mask[maskWidth][maskHeight] =
{
    {1, 1, 1}, 
    {1, 0, 1}, 
    {1, 1, 1},
};

inline void checkErr(cl_int err, const char * name)
{
    if (err != CL_SUCCESS)
    {
        std::cerr << "ERROR: " << name
            << " (" << err << ")" << std::endl;
        exit(EXIT_FAILURE);
    }
}

void CL_CALLBACK contextCallback(
    const char * errInfo,
    const void * private_info,
    size_t cb,
    void * user_data)
{
    std::cout << "Error occurred during context use: "<< errInfo << std::endl;
    exit(EXIT_FAILURE);
}

int main(int argc,char argv[]){
    cl_int errNum;

    cl_uint numPlatforms;
    cl_uint numDevices;

    cl_platform_id * platformIDs;
    cl_device_id * deviceIDs;

    cl_context context = NULL;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;

    cl_mem inputSignalBuffer;
    cl_mem outputSignalBuffer;
    cl_mem maskBuffer;

    double start,end,Totaltime;//Timer variables

    errNum = clGetPlatformIDs(0, NULL, &numPlatforms);

    checkErr(
        (errNum != CL_SUCCESS) ? errNum :
        (numPlatforms <= 0 ? -1 : CL_SUCCESS),
        "clGetPlatformIDs");

    platformIDs = (cl_platform_id *)malloc(sizeof(cl_platform_id) * numPlatforms);

    errNum = clGetPlatformIDs(numPlatforms, platformIDs, NULL);

    checkErr(
        (errNum != CL_SUCCESS) ? errNum :
        (numPlatforms <= 0 ? -1 : CL_SUCCESS), "clGetPlatformIDs");

    deviceIDs = NULL;

    cl_uint i;

    for (i = 0; i < numPlatforms; i++)
    {
        errNum = clGetDeviceIDs(
            platformIDs[i],
            CL_DEVICE_TYPE_GPU,
            0,
            NULL,
            &numDevices);
        if (errNum != CL_SUCCESS && errNum != CL_DEVICE_NOT_FOUND)
        {
            checkErr(errNum, "clGetDeviceIDs");
        }
        else if (numDevices > 0)
        {
            deviceIDs = (cl_device_id *)malloc(
                sizeof(cl_device_id) * numDevices);

            errNum = clGetDeviceIDs(
                platformIDs[i], 
                CL_DEVICE_TYPE_GPU, 
                numDevices,
                &deviceIDs[0], 
                NULL);

            checkErr(errNum, "clGetDeviceIDs");

            break;
        }
    }
    if (deviceIDs == NULL) {
        std::cout << "No CPU device found" << std::endl;
        exit(-1);
    }
    cl_context_properties contextProperties[] =
    {
        CL_CONTEXT_PLATFORM,(cl_context_properties)platformIDs[i], 0
    };

    context = clCreateContext(
        contextProperties, numDevices, deviceIDs,
        &contextCallback, NULL, &errNum);

    checkErr(errNum, "clCreateContext");

    std::ifstream srcFile("convolution.cl");

    checkErr(srcFile.is_open() ? CL_SUCCESS : -1,
        "reading convolution.cl");

    std::string srcProg(
        std::istreambuf_iterator<char>(srcFile),
        (std::istreambuf_iterator<char>()));

    const char * src = srcProg.c_str();
    size_t length = srcProg.length();

    program = clCreateProgramWithSource(context, 1, &src, &length, &errNum);

    checkErr(errNum, "clCreateProgramWithSource");

    errNum = clBuildProgram(program, numDevices, deviceIDs, NULL, NULL, NULL);

    checkErr(errNum, "clBuildProgram");

    kernel = clCreateKernel(program, "convolve", &errNum);

    checkErr(errNum, "clCreateKernel");

    inputSignalBuffer = clCreateBuffer(
        context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
        sizeof(cl_uint) * inputSignalHeight * inputSignalWidth,
        static_cast<void *>(inputSignal), &errNum);

    checkErr(errNum, "clCreateBuffer(inputSignal)");    

    maskBuffer = clCreateBuffer(
        context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
        sizeof(cl_uint) * maskHeight * maskWidth,
        static_cast<void *>(mask), &errNum);

    checkErr(errNum, "clCreateBuffer(mask)");

    outputSignalBuffer = clCreateBuffer(
        context, CL_MEM_WRITE_ONLY,
        sizeof(cl_uint) * outputSignalHeight * outputSignalWidth,
        NULL, &errNum);

    checkErr(errNum, "clCreateBuffer(outputSignal)");

    queue = clCreateCommandQueue(
        context, deviceIDs[0], 0, &errNum);
    checkErr(errNum, "clCreateCommandQueue");

    errNum = clSetKernelArg(
        kernel, 0, sizeof(cl_mem), &inputSignalBuffer);
    errNum |= clSetKernelArg(
        kernel, 1, sizeof(cl_mem), &maskBuffer);
    errNum |= clSetKernelArg(
        kernel, 2, sizeof(cl_mem), &outputSignalBuffer);
    errNum |= clSetKernelArg(
        kernel, 3, sizeof(cl_uint), &inputSignalWidth);
    errNum |= clSetKernelArg(
        kernel, 4, sizeof(cl_uint), &maskWidth);

    checkErr(errNum, "clSetKernelArg");

    const size_t globalWorkSize[1] ={ outputSignalWidth * outputSignalHeight };
    const size_t localWorkSize[1] = { 1 };

    start = clock();

    errNum = clEnqueueNDRangeKernel(
                                    queue,
                                    kernel,
                                    1,
                                    NULL,
                                    globalWorkSize,
                                    localWorkSize,
                                    0,
                                    NULL,
                                    NULL
                                    );

    checkErr(errNum, "clEnqueueNDRangeKernel");

    errNum = clEnqueueReadBuffer(
        queue, outputSignalBuffer, CL_TRUE, 0,
        sizeof(cl_uint) * outputSignalHeight * outputSignalHeight,
        outputSignal, 0, NULL, NULL);

    checkErr(errNum, "clEnqueueReadBuffer");

    end= clock(); - start;
    cout<<"Time in ms = "<<((end/CLOCKS_PER_SEC) * 1000) << endl;

    for (int y = 0; y < outputSignalHeight; y++)
    {
        for (int x = 0; x < outputSignalWidth; x++)
        {
            std::cout << outputSignal[x][y] << " ";
        }
        std::cout << std::endl;
    }

    return 0;
}

问题:我有以下疑问-

代码语言:javascript
复制
 1) When I am using device type as CL\_DEVICE\_TYPE\_GPU,
代码语言:javascript
复制
 am getting 267 ms performance .When I am using CL\_DEVICE\_TYPE\_CPU,execution time changed to 467 ms.     I want to know that what is the difference between running a convolution code on a CPU without GPU and CPU with GPU (by selecting device type as CL\_DEVICE\_TYPE\_CPU) .
代码语言:javascript
复制
 2) As I can see the convolution.cl file where there is a for loop which is executing 3 times.Can I call other Kernel for doing this operation from available kernel file ??

我问这个问题,因为我是新的OpenCL编码,并想知道这件事。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-09-18 13:04:13

  1. CPU和GPU都是OpenCL设备。因此,通过选择CL_DEVICE_TYPE_CPU,您将告诉OpenCL运行时将内核代码编译到CPU汇编程序中,并在CPU上运行。当您选择CL_DEVICE_TYPE_GPU时,内核代码被编译到GPU汇编程序&在您的显卡上执行。更改设备类型而不重写源代码的能力是OpenCL的主要功能之一。没关系,您的CPU是否集成了GPU和/或安装了离散GPU,您只需选择可用的设备并在其上运行内核即可。
  2. 对于OpenCL 1.2及更早版本,您不能从内核调用内核。动态并行是在OpenCL 2.0中实现的。
票数 5
EN

Stack Overflow用户

发布于 2014-09-18 18:04:16

对于第一个问题:您应该将内核向量化,这样opencl就可以轻松地使用CPU的SIMD特性,从而解锁4x(或8x)每个核心的更多计算单元。

代码语言:javascript
复制
__kernel 
void convolve(
    const __global uint8 * const input, // uint8 fits AVX(AVX2?) and uint4 fits SSE(SSE3?)
    __constant uint8     * const mask,
    __global uint8       * const output,
    const int                   inputWidth,
    const int                   maskWidth){

        const int x = get_global_id(0);  // this is 1/8 size now
        const int y = get_global_id(1);  // this is 1/8 size now

        uint8 sum = 0;                      // a vector of 8 unsigneds

        for (int r = 0; r < maskWidth; r++)
        {
            const int idxIntmp = (y + r) * inputWidth + x; 
            for (int c = 0; c < maskWidth; c++)
            {
                sum += mask[(r * maskWidth) + c] * input[idxIntmp + c]; //8 issued per clock
                 // scalars get promoted when used in direct multiplication of addition.
            }
        }

        output[y * get_global_size(0) + x] = sum;
}

不要忘记将总工作线程减少7/8 (例如:从8k线程减少到1k线程)。请增加每个线程的工作量,例如每线程50次卷积,以提高工作单元的占用率,然后进行一些本地内存优化(用于GPU),以获得更好的结果,如每个内核5ms。

在我的AVX能力的CPU上,一个简单的矩阵乘法得到了2.4倍的速度,用于像这样的8元素矢量化。

如果卸载足够多的工作,那么运行3次内核并不是一个问题。如果没有,您应该使用一些复杂的算法将多个内核连接到一个单独的内核中。

如果目前无法使用分析器,您可以检查GPU/CPU温度,以了解您离硬件极限有多近。

处理每个工作组的本地线程数。这可以改变性能,因为它允许每个线程使用或多或少的寄存器。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/25910562

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档