问卤化物GPU调度器比CPU慢
EN

Stack Overflow用户

提问于 2019-11-21 01:04:35

回答 1查看 507关注 0票数 1

我已经写了一个简单的Halide代码来计算从0到n的数字的平方，然而它在GPU上比在CPU上多花22倍的时间。

#include"stdafx.h"
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
#include "HalideRuntimeOpenCL.h"

#define GPU_TILE 16
#define COMPUTE_SIZE 1024

Target find_gpu_target();

// Define some Vars to use.
Halide::Var x, y, xo, yo, xi, yi;


// We're going to want to schedule a pipeline in several ways, so we
// define the pipeline in a class so that we can recreate it several
// times with different schedules.
class MyPipeline {
public:
    Halide::Func f;

    MyPipeline() {
        f(x) = x * x;
    }

    // Now we define methods that give our pipeline several different
    // schedules.
    void schedule_for_cpu() {

        // JIT-compile the pipeline for the CPU.
        Target target = get_host_target();
        f.compile_jit(target);

    }

    // Now a schedule that uses CUDA or OpenCL.
    bool schedule_for_gpu() {
        Target target = find_gpu_target();
        if (!target.has_gpu_feature()) {
            return false;
        }

        // Schedule f on the GPU in 16x16 tiles.
        f.gpu_tile(x, xo, xi, GPU_TILE);
        f.compile_jit(target);

        return true;
    }

    void test_performance() {
        // Test the performance of the scheduled MyPipeline.


        // Run the filter once to initialize any GPU runtime state.
        // Run it.
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);

        // Now take the best of 3 runs for timing.
        double best_time = 0.0;
        for (int i = 0; i < 3; i++) {

            double t1 = clock();//current_time();

            // Run the filter 100 times.
            for (int j = 0; j < 100; j++) {
                // Run it.
                Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
                // Force any GPU code to finish by copying the buffer back to the CPU.
                result.copy_to_host();
            }

            double t2 = clock();// current_time();

            double elapsed = (t2 - t1) / 100;
            if (i == 0 || elapsed < best_time) {
                best_time = elapsed;
            }
            best_time = (t2 - t1) * 1000. / CLOCKS_PER_SEC;
        }

        printf("%1.4f milliseconds\n", best_time);  
    }
    bool test_correctness() {
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
        for (int i = 0; i < COMPUTE_SIZE; i++)
        {
            if (result(i) != i * i)
                return false;
        }
        return true;
    }
};

int main(int argc, char **argv) {

    MyPipeline p1;
    p1.schedule_for_cpu();
    printf("Running pipeline on CPU:\n");
    printf("Test Correctness of cpu scheduler: %d\n",p1.test_correctness());

    MyPipeline p2;
    bool has_gpu_target = p2.schedule_for_gpu();
    printf("Running pipeline on GPU:\n");
    printf("Test Correctness of gpu scheduler: %d\n", p2.test_correctness());


    printf("Testing performance on CPU:\n");
    p1.test_performance();

    if (has_gpu_target) {
        printf("Testing performance on GPU:\n");
        p2.test_performance();
    }

    return 0;
}


Target find_gpu_target() {
    // Start with a target suitable for the machine you're running this on.
    Target target = get_host_target();

    // Uncomment the following lines to try CUDA instead:
    target.set_feature(Target::CUDA);
    // Enable debugging so that you can see what OpenCL API calls we do.
    //target.set_feature(Halide::Target::Debug);
    return target;
}

输出

Running pipeline on CPU:
Test Correctness of cpu scheduler: 1
Running pipeline on GPU:
Test Correctness of gpu scheduler: 1
Testing performance on CPU:
1.0000 milliseconds
Testing performance on GPU:
22.0000 milliseconds

我已经尝试使用调试标志运行GPU调度程序，记录的时间如下

1.303033e+00 : CUDA: CUDA ms

1.070443e+00 : CUDA: CUDA ms

5.184570e+00 : CUDA: CUDA ms

CUDA: halide_cuda_buffer_copy : 7.340180e-01 ms

1.317381e+00 : CUDA: CUDA ms

编辑1: Halide是否有可能只初始化gpu内核和malloc/free一次，并为不同的输入重用内核？

halide

回答 1

Stack Overflow用户

发布于 2019-11-21 08:45:08

这可能是GPU上API开销的瓶颈。它每次迭代只运行1k个点，这远远不足以填满大多数GPU，并且每个点只执行一次乘法和存储。然后序列化内核，启动→拷贝到主机。如果你在原始的CUDA或OpenCL中做了同样的事情，这仍然会远远低于峰值性能。

要测量更少的API开销和更多的原始计算，请尝试运行更复杂的内核，持续更长的一段时间，并可能在调用主机的copy之前多次调用内核。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58959744

复制

相似问题

问卤化物GPU调度器比CPU慢
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问卤化物GPU调度器比CPU慢EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问卤化物GPU调度器比CPU慢
EN