首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >卤化物GPU调度器比CPU慢

卤化物GPU调度器比CPU慢
EN

Stack Overflow用户
提问于 2019-11-21 01:04:35
回答 1查看 507关注 0票数 1

我已经写了一个简单的Halide代码来计算从0到n的数字的平方,然而它在GPU上比在CPU上多花22倍的时间。

代码语言:javascript
复制
#include"stdafx.h"
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
#include "HalideRuntimeOpenCL.h"

#define GPU_TILE 16
#define COMPUTE_SIZE 1024

Target find_gpu_target();

// Define some Vars to use.
Halide::Var x, y, xo, yo, xi, yi;


// We're going to want to schedule a pipeline in several ways, so we
// define the pipeline in a class so that we can recreate it several
// times with different schedules.
class MyPipeline {
public:
    Halide::Func f;

    MyPipeline() {
        f(x) = x * x;
    }

    // Now we define methods that give our pipeline several different
    // schedules.
    void schedule_for_cpu() {

        // JIT-compile the pipeline for the CPU.
        Target target = get_host_target();
        f.compile_jit(target);

    }

    // Now a schedule that uses CUDA or OpenCL.
    bool schedule_for_gpu() {
        Target target = find_gpu_target();
        if (!target.has_gpu_feature()) {
            return false;
        }

        // Schedule f on the GPU in 16x16 tiles.
        f.gpu_tile(x, xo, xi, GPU_TILE);
        f.compile_jit(target);

        return true;
    }

    void test_performance() {
        // Test the performance of the scheduled MyPipeline.


        // Run the filter once to initialize any GPU runtime state.
        // Run it.
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);

        // Now take the best of 3 runs for timing.
        double best_time = 0.0;
        for (int i = 0; i < 3; i++) {

            double t1 = clock();//current_time();

            // Run the filter 100 times.
            for (int j = 0; j < 100; j++) {
                // Run it.
                Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
                // Force any GPU code to finish by copying the buffer back to the CPU.
                result.copy_to_host();
            }

            double t2 = clock();// current_time();

            double elapsed = (t2 - t1) / 100;
            if (i == 0 || elapsed < best_time) {
                best_time = elapsed;
            }
            best_time = (t2 - t1) * 1000. / CLOCKS_PER_SEC;
        }

        printf("%1.4f milliseconds\n", best_time);  
    }
    bool test_correctness() {
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
        for (int i = 0; i < COMPUTE_SIZE; i++)
        {
            if (result(i) != i * i)
                return false;
        }
        return true;
    }
};

int main(int argc, char **argv) {

    MyPipeline p1;
    p1.schedule_for_cpu();
    printf("Running pipeline on CPU:\n");
    printf("Test Correctness of cpu scheduler: %d\n",p1.test_correctness());

    MyPipeline p2;
    bool has_gpu_target = p2.schedule_for_gpu();
    printf("Running pipeline on GPU:\n");
    printf("Test Correctness of gpu scheduler: %d\n", p2.test_correctness());


    printf("Testing performance on CPU:\n");
    p1.test_performance();

    if (has_gpu_target) {
        printf("Testing performance on GPU:\n");
        p2.test_performance();
    }

    return 0;
}


Target find_gpu_target() {
    // Start with a target suitable for the machine you're running this on.
    Target target = get_host_target();

    // Uncomment the following lines to try CUDA instead:
    target.set_feature(Target::CUDA);
    // Enable debugging so that you can see what OpenCL API calls we do.
    //target.set_feature(Halide::Target::Debug);
    return target;
}

输出

代码语言:javascript
复制
Running pipeline on CPU:
Test Correctness of cpu scheduler: 1
Running pipeline on GPU:
Test Correctness of gpu scheduler: 1
Testing performance on CPU:
1.0000 milliseconds
Testing performance on GPU:
22.0000 milliseconds   

我已经尝试使用调试标志运行GPU调度程序,记录的时间如下

1.303033e+00 : CUDA: CUDA ms

1.070443e+00 : CUDA: CUDA ms

5.184570e+00 : CUDA: CUDA ms

CUDA: halide_cuda_buffer_copy : 7.340180e-01 ms

1.317381e+00 : CUDA: CUDA ms

编辑1: Halide是否有可能只初始化gpu内核和malloc/free一次,并为不同的输入重用内核?

EN

回答 1

Stack Overflow用户

发布于 2019-11-21 08:45:08

这可能是GPU上API开销的瓶颈。它每次迭代只运行1k个点,这远远不足以填满大多数GPU,并且每个点只执行一次乘法和存储。然后序列化内核,启动→拷贝到主机。如果你在原始的CUDA或OpenCL中做了同样的事情,这仍然会远远低于峰值性能。

要测量更少的API开销和更多的原始计算,请尝试运行更复杂的内核,持续更长的一段时间,并可能在调用主机的copy之前多次调用内核。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58959744

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档