文章/答案/技术大牛

发布

社区首页 >问答首页 >Intrinsics与朴素向量约简结果的差异

问Intrinsics与朴素向量约简结果的差异
EN

Stack Overflow用户

提问于 2021-12-30 14:13:40

回答 1查看 104关注 0票数 1

我一直在比较Intrinsics向量约简、朴素向量约简和使用openmp语用的向量约简的运行时间。然而，我发现在这些场景中，结果是不同的。代码如下-(从-Fastest way to do horizontal SSE vector sum (or other reduction)中提取的本质向量约简)

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>


inline float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}


float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}

void reduceVector_Naive(std::vector<float> values){
    float result = 0;
    for(int i=0; i<int(1e8); i++){
        result  += values.at(i);
    }
    printf("Reduction Naive = %f \n", result);
}


void reduceVector_openmp(std::vector<float> values){
    float result = 0;
    #pragma omp simd reduction(+: result)
    for(int i=0; i<int(1e8); i++){
        result  += values.at(i);
    }

    printf("Reduction OpenMP = %f \n", result);
}

void reduceVector_intrinsics(std::vector<float> values){
    float result = 0;
    float* data_ptr = values.data();

    for(int i=0; i<1e8; i+=8){
        result  += hsum256_ps_avx(_mm256_loadu_ps(data_ptr + i));
    }

    printf("Reduction Intrinsics = %f \n", result);
}


int main(){

    std::vector<float> values;

    for(int i=0; i<1e8; i++){
        values.push_back(1);
    }


    reduceVector_Naive(values);
    reduceVector_openmp(values);
    reduceVector_intrinsics(values);

// The result should be 1e8 in each case
}

不过，我的输出如下-

Reduction Naive = 16777216.000000 
Reduction OpenMP = 16777216.000000 
Reduction Intrinsics = 100000000.000000

可以看出，只有本征函数才能正确地计算出它，而其他函数则面临精度问题。我充分意识到由于舍入而使用浮点数可能面临的精度问题，所以我的问题是，为什么本质上的答案是正确的，尽管它实际上也是浮值算法。

我正在编译它为- g++ -mavx2 -march=native -O3 -fopenmp main.cpp

使用7.5.0版和10.3.0版进行了尝试

提亚

c++

vector

simd

ieee-754

intrinsics

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-12-30 14:48:10

天真的循环由1.0添加，它停止在16777216.000000添加，因为binary32浮动中没有足够的有效数字。

见以下问答：Why does a float variable stop incrementing at 16777216 in C#?

当你把计算出来的水平和相加时，它会用8.0加起来，所以当它停止加法的时候，它的数值大约是16777216*8 = 134217728，你只是在实验中没有达到它。

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70532839

复制

相似问题

问Intrinsics与朴素向量约简结果的差异
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Intrinsics与朴素向量约简结果的差异EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Intrinsics与朴素向量约简结果的差异
EN