进一步到关于CodeReview的问题,我想知道为什么使用std::plus<int>实现两个向量的简单转换比使用顺序std::transform和使用OpenMP (序列(向量化):25 am )的for循环慢得多,顺序(没有向量化):28 am,C++AMP: 131 am,PPL: 51 am,OpenMP: 24 am)。
我在Visual 2013中使用了以下代码进行分析和编译,并进行了完全优化:
#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>
using namespace concurrency;
const std::size_t size = 30737418;
//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
accelerator default_device;
std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
if( default_device == accelerator( accelerator::direct3d_ref ) )
std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;
std::mt19937 engine;
std::uniform_int_distribution<int> dist( 0, 10000 );
std::vector<int> vecTest( size );
std::vector<int> vecTest2( size );
std::vector<int> vecResult( size );
for( int i = 0; i < size; ++i )
{
vecTest[i] = dist( engine );
vecTest2[i] = dist( engine );
}
std::vector<int> vecCorrectResult( size );
std::chrono::high_resolution_clock clock;
auto beginTime = clock.now();
std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );
auto endTime = clock.now();
auto timeTaken = endTime - beginTime;
std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
beginTime = clock.now();
#pragma loop(no_vector)
for( int i = 0; i < size; ++i )
{
vecResult[i] = vecTest[i] + vecTest2[i];
}
endTime = clock.now();
timeTaken = endTime - beginTime;
std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
beginTime = clock.now();
concurrency::array_view<const int, 1> av1( vecTest );
concurrency::array_view<const int, 1> av2( vecTest2 );
concurrency::array_view<int, 1> avResult( vecResult );
avResult.discard_data();
concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
avResult[index] = av1[index] + av2[index];
} );
avResult.synchronize();
endTime = clock.now();
timeTaken = endTime - beginTime;
std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;
beginTime = clock.now();
concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );
endTime = clock.now();
timeTaken = endTime - beginTime;
std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;
beginTime = clock.now();
#pragma omp parallel
#pragma omp for
for( int i = 0; i < size; ++i )
{
vecResult[i] = vecTest[i] + vecTest2[i];
}
endTime = clock.now();
timeTaken = endTime - beginTime;
std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;
return 0;
}发布于 2014-07-07 10:59:15
根据MSDN,concurrency::parallel_transform的默认分区程序是concurrency::auto_partitioner。当涉及到它的时候:
这种划分雇用范围的方法,用于负载平衡以及每次迭代取消。
使用这个分区程序对于一个简单的(和内存绑定的)操作来说是一个过度的操作,比如将两个数组相加,因为开销很大。您应该使用concurrency::static_partitioner。当OpenMP构造中缺少schedule子句时,静态分区正是大多数for实现默认使用的。
正如在代码评审中已经提到的,这是一个非常内存绑定的代码.它也是SUM内核的流基准,它是专为测量系统的内存带宽而设计的。a[i] = b[i] + c[i]操作的操作强度很低(以OPS/字节为单位),其速度完全取决于主存总线的带宽。这就是为什么OpenMP代码和矢量化串行代码提供了基本相同的性能,这并不比非向量化串行代码的性能高得多。
获得更高的并行性能的方法是在现代的多套接字系统上运行代码,并使每个数组中的数据均匀地分布在套接字上。然后,您可以获得几乎等于CPU套接字数量的速度。
https://stackoverflow.com/questions/24594454
复制相似问题