文章/答案/技术大牛

发布

社区首页 >问答首页 >FMA操作的更密集优化

问FMA操作的更密集优化
EN

Stack Overflow用户

提问于 2020-11-04 14:47:22

回答 3查看 153关注 0票数 3

我希望构建一个数据类型，它表示多个(比如N)算术类型，并使用操作符重载提供与算术类型相同的接口，这样我就可以得到一个数据类型，比如Agner的向量类。

请看这个例子：哥德波特

#include <array>

using std::size_t;

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD operator*(const T a, const LoopSIMD& x){
        LoopSIMD result;
        for(size_t i=0;i<S;++i)
            result[i] = a*x[i];
        return result;
    }

    LoopSIMD& operator +=(const LoopSIMD& x){
        for(size_t i=0;i<S;++i){
            (*this)[i] += x[i];
        }
        return *this;
    }
};

constexpr size_t N = 7;
typedef LoopSIMD<double,N> SIMD;

SIMD foo(double a, SIMD x, SIMD y){
    x += a*y;
    return x;
}

这似乎是相当好的工作到一定数量的元素，这是6的gcc-10和27的clang-11。对于更多的元素，编译器不再使用FMA (例如vfmadd213pd)操作。相反，它们分别处理乘法(例如vmulpd)和加法(例如vaddpd)。

问题：

这种行为有什么好的理由吗？
有没有编译器的标志，我可以增加上面提到的值，gcc和27的clang？

谢谢!

c++

gcc

clang

fma

回答 3

Stack Overflow用户

发布于 2020-11-04 15:07:48

我做了以下工作，并得到了一些不错的结果，gcc 10.2和你的-Ofast -march=skylake -ffast-math链接一样。

friend LoopSIMD operator*(const T a, const LoopSIMD& x) {
    LoopSIMD result;
    std::transform(x.cbegin(), x.cend(), result.begin(),
                   [a](auto const& i) { return a * i; });
    return result;
}

LoopSIMD& operator+=(const LoopSIMD& x) {
    std::transform(this->cbegin(), this->cend(), x.cbegin(), this->begin(),
                   [](auto const& a, auto const& b) { return a + b; });
    return *this;
}

std::transform有一些疯狂的超载，所以我想我需要解释一下。

第一个重载捕获a，乘以每个值，并在结果开始时将其存储回。

第二个重载充当一个zip，将来自x和this的两个值相加，并将结果存储回this。

如果你没有和operator+=和operator*结婚，你可以像这样创建你自己的fma

    LoopSIMD& fma(const LoopSIMD& x, double a ){
        std::transform_inclusive_scan(
            x.cbegin(),
            x.cend(),
            this->begin(),
            std::plus{},
            [a](auto const& i){return i * a;},
            0.0);
        return *this;
    }

这需要c++17，但是循环将使SIMD指令保持在

foo(double, LoopSIMD<double, 40ul>&, LoopSIMD<double, 40ul> const&):
        xor     eax, eax
        vxorpd  xmm1, xmm1, xmm1
.L2:
        vfmadd231sd     xmm1, xmm0, QWORD PTR [rsi+rax]
        vmovsd  QWORD PTR [rdi+rax], xmm1
        add     rax, 8
        cmp     rax, 320
        jne     .L2
        ret

票数 0

Stack Overflow用户

发布于 2020-11-04 17:39:54

您还可以简单地创建自己的fma功能：

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x, const T y, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i], y, z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y, const LoopSIMD& x, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y, x[i], z[i]);
        }
        return result;
    }
    // And more variants, taking `const LoopSIMD&, const LoopSIMD&, const T`, `const LoopSIMD&, const T, const T`, etc
};

SIMD foo(double a, SIMD x, SIMD y){
    return fma(a, y, x);
}

但是，为了从一开始就允许进行更好的优化，您应该对齐数组。如果您这样做，您的原始代码会很好地优化：

constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}

票数 0

Stack Overflow用户

发布于 2021-09-20 10:29:11

我发现给出的例子有改进。

在循环之前添加#pragma omp simd，GCC设法将FMA优化提高到N=71。

https://godbolt.org/z/Y3T1rs37W

如果使用AVX512，则可以进一步改进大小：

https://godbolt.org/z/jWWPP7W5G

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64682270

复制

相似问题

问FMA操作的更密集优化
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问FMA操作的更密集优化EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问FMA操作的更密集优化
EN