因此,我通过openMP,试图研究一个cpu或gpu是否会运行图像模糊更快/更慢/或适度的相对。在我看来,gpu应该运行得更快一点,因为gpu的操作相对较快,对吗?因为cpu可以在合理的时间执行复杂的操作:
下面是我用来测试它的代码:
IplImage* gaussian_blur_parallel(IplImage* image, double r) {
IplImage* result = cvCloneImage(image);
int h = image->height;
int w = image->width;
double rs = ceil(r * 2.57); // significant radius
std::clock_t start;
start = std::clock();
#pragma omp parallel for schedule(guided) num_threads(4)
for(int i=0; i<h; i++) {
int current_num_threads = omp_get_num_threads();
std::cout<<"threads"<<current_num_threads<<std::endl;
for (int j = 0; j < w; j++) {
Weights weights;
for(int iy = i-rs; iy<i+rs+1; iy++) {
for (int ix = j - rs; ix < j + rs + 1; ix++) {
int x = myMin(w - 1, myMax(0, ix));
int y = myMin(h - 1, myMax(0, iy));
double dsq = (ix - j) * (ix - j) + (iy - i) * (iy - i);
double wght = exp(-dsq / (2 * r * r)) / (PI * 2 * r * r);
CvScalar channels = cvGet2D(image, y, x);
// calculate the value for each channel
for (int c = 0; c < 3; c++) {
weights.value[c] += channels.val[c] * wght;
weights.weight[c] += wght;
}
}
}
// set the value for each channel in the resulting image.
// printf("i=%d, j=%d, r=%f, g=%f, b=%f\n", i, j, weights.value[0], weights.value[1], weights.value[2]);
CvScalar resultingChannels = cvGet2D(result, i, j);
for(int c=0; c < 3; c++) {
resultingChannels.val[c] = round(weights.value[c] / weights.weight[c]);
weights.value[c] = 0.0;
weights.weight[c] = 0.0;
}
cvSet2D(result, i, j, resultingChannels);
}
}
std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
return result;
}从我在文档中所看到的,任何在内部运行的东西--实用主义评论应该在gpu上做的工作--是正确的吗?
然而,如果我这样做没有实用主义评论(我假设这是cpu工作.)
IplImage* gaussian_blur(IplImage* image, double r) {
IplImage* result = cvCloneImage(image);
int h = image->height;
int w = image->width;
printf("h=%d, w=%d", h, w);
double rs = ceil(r * 2.57); // significant radius
std::clock_t start;
start = std::clock();
for(int i=0; i<h; i++) {
for (int j = 0; j < w; j++) {
Weights weights;
for(int iy = i-rs; iy<i+rs+1; iy++) {
for (int ix = j - rs; ix < j + rs + 1; ix++) {
int x = myMin(w - 1, myMax(0, ix));
int y = myMin(h - 1, myMax(0, iy));
double dsq = (ix - j) * (ix - j) + (iy - i) * (iy - i);
double wght = exp(-dsq / (2 * r * r)) / (PI * 2 * r * r);
CvScalar channels = cvGet2D(image, y, x);
// calculate the value for each channel
for (int c = 0; c < 3; c++) {
weights.value[c] += channels.val[c] * wght;
weights.weight[c] += wght;
}
}
}
// set the value for each channel in the resulting image.
// printf("i=%d, j=%d, r=%f, g=%f, b=%f\n", i, j, weights.value[0], weights.value[1], weights.value[2]);
CvScalar resultingChannels = cvGet2D(result, i, j);
for(int c=0; c < 3; c++) {
resultingChannels.val[c] = round(weights.value[c] / weights.weight[c]);
weights.value[c] = 0.0;
weights.weight[c] = 0.0;
}
cvSet2D(result, i, j, resultingChannels);
}
}
std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
return result;
}这样做的时间相对于gpu所做的是相同的。
,所以我的问题是,我所做的与gpu无关,对gpu没有多大影响吗?是我把钟放错地方了吗?如何检查在OpenMP?中有多少线程被正确地并行使用?
发布于 2015-12-07 13:09:36
首先,你用了错误的方法来测量时间。std::clock测量的是CPU时间,而不是挂钟时间.除非某种超线性加速比效应发挥作用,否则在运行多线程时,您将永远不会看到测量值的下降。使用omp_get_wtime()代替。
其次,parallel for构造将在加速器设备上执行而不是,除非嵌套在target构造的范围内:
double start;
start = omp_get_wtime();
#pragma omp target ...
#pragma omp parallel for schedule(guided)
for(int i=0; i<h; i++) {
...
}
std::cout << "Time: " << (omp_get_wtime() - start) * 1000.0 << " ms" << std::endl;在具有自己内存空间的GPU和其他加速器上执行需要适当设置数据环境。这包括将各种map子句添加到target指令中,该指令指示编译器在执行之前将某些信息复制到设备上,在卸载区域完成后将其复制出设备。查看map子句的文档,然后仔细检查代码中使用的所有变量,并编写相应的子句。
另外,请注意这样一个事实,即在target区域内调用的所有函数都必须使用declare target构造进行专门标记。这适用于像cvGet2D()、cvSet2D()、myMin()和myMax()这样的函数。如果这些宏是作为预处理器宏实现的,则不需要将它们声明为目标函数。否则,编译器将不会在设备上生成可调用的版本,这将导致错误。
一旦所有这一切就绪,您必须将正确的命令行选项传递给编译器,以便能够生成目标代码。此外,确保编译器支持GPU卸载。例如,Intel编译器只支持卸载到Intel。GCC应该支承卸载到英特尔Xeon和NVIDIA GPU的最新版本,但似乎仍然存在一些问题。您当前将OpenMP卸载到NVIDIA的最佳选择是PGI编译器套件。不知道什么情况下卸载到AMD GPU是。
https://stackoverflow.com/questions/34123947
复制相似问题