文章/答案/技术大牛

发布

社区首页 >问答首页 >如何减少OpenCL/Cloo (C#)的缓冲区创建开销？

问如何减少OpenCL/Cloo (C#)的缓冲区创建开销？
EN

Stack Overflow用户

提问于 2017-02-23 23:13:00

回答 1查看 813关注 0票数 1

我正在通过OpenCL Cloo界面使用C#，当我试图让它在我们的产品中运行的时候，我遇到了一些非常令人沮丧的问题。

我们的产品是一种计算机视觉产品，每秒30次，从我们的相机获得512x424像素值的网格。我们希望对这些像素进行计算，以生成相对于场景中某些对象的点云。

我正在尝试计算这些像素，当我们得到一个新的帧时，如下(每一帧)：

1)创建一个CommandQueue，2)创建一个仅为输入像素值读取的缓冲区，3)创建一个仅针对输出点值写入的零拷贝缓冲区。4)传递用于在GPU上进行计算的矩阵，5)执行内核并等待响应。

每帧工作的一个例子是：

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        Point3D[] realWorldPoints = points.Get(Perspective.RealWorld).Points;
        ComputeBuffer<Point3D> realPointsBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
            realWorldPoints);
        _kernel.SetMemoryArgument(0, realPointsBuffer);

        Point3D[] toPopulate = new Point3D[realWorldPoints.Length];
        PointSet pointSet = points.Get(perspective);

        ComputeBuffer<Point3D> resultBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.UseHostPointer,
            toPopulate);
        _kernel.SetMemoryArgument(1, resultBuffer);
            float[] M = new float[3 * 3];
            ReferenceFrame referenceFrame =
                perspectives.ReferenceFrames[(int)Perspective.Floor];
            AffineTransformation transform = referenceFrame.ToReferenceFrame;
            M[0] = transform.M00;
            M[1] = transform.M01;
            M[2] = transform.M02;
            M[3] = transform.M10;
            M[4] = transform.M11;
            M[5] = transform.M12;
            M[6] = transform.M20;
            M[7] = transform.M21;
            M[8] = transform.M22;

            ComputeBuffer<float> Mbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                M);
            _kernel.SetMemoryArgument(2, Mbuffer);

            float[] b = new float[3];
            b[0] = transform.b0;
            b[1] = transform.b1;
            b[2] = transform.b2;

            ComputeBuffer<float> Bbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                b);
            _kernel.SetMemoryArgument(3, Bbuffer);

            _kernel.SetValueArgument<int>(4, (int)Perspective.Floor);

            //sw.Start();

            commandQueue.Execute(_kernel,
                new long[] { 0 }, new long[] { toPopulate.Length }, null, null);
            IntPtr retPtr = commandQueue.Map(
                resultBuffer,
                true,
                ComputeMemoryMappingFlags.Read,
                0,
                toPopulate.Length, null);

            commandQueue.Unmap(resultBuffer, ref retPtr, null);

当分析时，时间太长了，90%的时间是在创建所有ComputeBuffer对象等过程中完成的。GPU上的实际计算时间非常快。

我的问题是，我该怎么解决这个问题？每个帧的像素数组是不同的，所以我必须为此创建一个新的ComputeBuffer。当我们更新场景时，我们的矩阵也可以周期性地改变(同样，我不能进入所有的细节)。有没有办法更新GPU上的缓冲区？我使用的是英特尔GPGPU，所以我有共享内存，理论上可以这样做。

它变得令人沮丧，因为一次又一次，我在GPU上发现的速度增长，充斥着为每一帧设置一切的开销。

编辑1:

我不认为我的原始代码示例真正展示了我做的足够好，所以我创建了一个真实的工作示例，并将其发布在github 这里上。

由于遗留的原因和时间的原因，我无法改变我们当前产品的压倒一切的架构。为了加快速度，我试图在某些缓慢的部分“插入”GPU代码。考虑到我所看到的约束，这可能是不可能的。不过，让我来解释一下我在做什么。

我将给出代码，但我将引用类"ComputePoints“中的函数"GPUComputePoints”。

正如您在我的ComputePoints函数中看到的那样，每次传入一个CameraFrame以及转换矩阵M和b。

public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)

这些是从我们的管道中生成的新数组，而不是我可以留待的数组。因此，我为每个用户创建了一个新的ComputeBuffer：

       ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
          ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
          frame.RawData);
        _kernel.SetMemoryArgument(0, inputBuffer);

        Point3D[] ret = new Point3D[frame.Width * frame.Height]; 
        ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(1, outputBuffer);

        ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            M);
        _kernel.SetMemoryArgument(2, mBuffer);

        ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            b);
         _kernel.SetMemoryArgument(3, bBuffer);

我相信，...and的存在是对性能的消耗。有人提到，为了绕过这个问题，请使用map/unmap功能。但是我看不出这会有什么帮助，因为我仍然需要每次创建缓冲区来封装正在传递的新数组，对吗？

opencl

cloo

回答 1

Stack Overflow用户

发布于 2017-02-24 00:05:56

每个帧的像素数组是不同的，所以我必须为此创建一个新的ComputeBuffer。

您可以创建一个大缓冲区，然后将其范围用于多个不同的帧。然后，您不需要重新创建(或重新发布)在每个帧。

当我们更新场景时，我们的矩阵也可以周期性地改变(同样，我不能进入所有的细节)。

对于N个迭代/帧的每个未使用的缓冲区，您可以释放，对于每个没有足够缓冲区存在的缓冲区，您可以释放最后一个缓冲区，并重新创建2x更大的缓冲区，以便在再次发布之前使用更多次。

如果内核参数的数量和顺序保持不变，也不需要在每个帧上设置它们。

有没有办法更新GPU上的缓冲区？

对于opencl版本的<=1.2 (没有共享虚拟内存？)，不建议在主机端使用设备端指针或在设备端使用主机端指针。

但是，如果它不与视频适配器或其他正在生成视频帧的东西发生冲突(并且可能使用use_host_ptr )，它可能会工作。

不需要重新创建CommandQueue。创建一次，用于每一项有序工作.

如果您正在重新创建所有这些，因为软件设计类似于：

 float [] results = test(videoFeedData);

然后你可以尝试这样的方法

float [] results = new float[n];
test(videoFeedData,results);

因此，它不需要创建所有东西，而是获取结果或输入数据的大小，然后创建opencl缓冲区一次，将其缓存在类似于映射/字典的某个地方，然后在使用类似大小的数组时重新使用。

实际工作如下：

new frame feed-0: 1kB data ---> allocate 1kB
feed-1: 10 MB data ---> allocate 10 MB, delete 1kB one
feed-2: 3 MB data ---> re-use 10MB one
feed-3: 2 kB data ---> re-use 10MB 
feed-4: 100 MB data ---> delete 10MB, allocate 100MB
feed-5: 110 MB data ----> delete 100MB, allocate 200MB
feed-6: 120 MB data  ---> re-use 200 MB
feed-7: 150 MB data  ---> re-use 200 MB 
feed-8: 90 MB data  ---> re-use 200 MB

用于输入和输出数据。

在实际重新创建开销的基础上，重新创建许多东西会阻碍驱动程序的优化和重置。

也许是这样的：

 CoresGpu gpu = new CoresGpu(kernelString,options,"gpu");

 for(i 0 to 100)
 {
   float [] results = new float[n];

   // allocate new, if only not enough, deallocate old, if only not used
   gpu.compute(new object[]{getVideoFeedBuffer(),brush21x21array,results},
             new string[]{"input","input","output"},
             kernelName,numberOfThreads);

   toCloudDb(results.toList());
 }

 gpu.release(); // everything is released here

如果重新创建是必须的，没有办法摆脱它，那么您甚至可以通过流水线来隐藏重新创建的延迟(但仍然比完美慢)。

push data
thread-0:get video feed

push data
thread-0:get next video feed
thread-1:send old video feed to gpu

push data
thread-0:get third video feed
thread-1:send second video feed to gpu
thread-2:compute on gpu

push data
thread-0:get fourth video feed
thread-1:send third video feed to gpu
thread-2:compute second frame on gpu
thread-3:get result of first frame from gpu to RAM

push data
thread-0:get fifth video feed
thread-1:send fourth video feed to gpu
thread-2:compute third frame on gpu
thread-3:get result of second frame from gpu to RAM
pop first data

...
...
pop second data

继续这样使用如下内容：

var result=gpu.pipeline.push(videoFeed);
if(result!=null)
{ result has been popped! }

重新创建延迟的一部分通过计算、复制、视频提要和弹出操作隐藏起来.如果重新创建占总时间的%90，则只隐藏%10。如果是%50，则隐藏其他%50。

5)执行内核并等待响应。

为什么要等？框架是相互绑定的吗？如果没有，您也可以使用多个管道。然后，您可以在每个管道中同时重新创建多个缓冲区，这样就可以完成更多的工作，但浪费的周期太多了。为每件事使用一个大缓冲区可能是最快的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42427951

复制

相似问题

问如何减少OpenCL/Cloo (C#)的缓冲区创建开销？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何减少OpenCL/Cloo (C#)的缓冲区创建开销？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何减少OpenCL/Cloo (C#)的缓冲区创建开销？
EN