文章/答案/技术大牛

发布

社区首页 >问答首页 >用SIMD (System.Numerics)编写向量和函数，使其比for循环更快

问用SIMD (System.Numerics)编写向量和函数，使其比for循环更快
EN

Stack Overflow用户

提问于 2021-05-19 14:57:38

回答 2查看 1.3K关注 0票数 4

我编写了一个函数，使用SIMD (System.Numerics.Vector)将double[]数组的所有元素相加，其性能比天真的方法差。

在我的计算机上，Vector<double>.Count是4，这意味着我可以创建一个由4个值组成的累加器，并通过数组将元素按组相加。

例如，一个10元素数组，其中包含4个元素累加器和2个剩余的元素。

//     | loop                  | remainder
acc[0] = vector[0] + vector[4] + vector[8]
acc[1] = vector[1] + vector[5] + vector[9]
acc[2] = vector[2] + vector[6] 
acc[3] = vector[3] + vector[7]

和结果sum = acc[0]+acc[1]+acc[2]+acc[3]

下面的代码产生了正确的结果，但是速度与仅仅一个循环相加值相比是不存在的

public static double SumSimd(this Span<double> a)
{
    var n = System.Numerics.Vector<double>.Count;
    var count = a.Length;
    // divide array into n=4 element groups
    // Example, 57 = 14*4 + 3
    var groups = Math.DivRem(count, n, out int remain);
    var buffer = new double[n];
    // Create buffer with remaining elements (not in groups)
    a.Slice(groups*n, remain).CopyTo(buffer);
    // Scan through all groups and accumulate
    var accumulator = new System.Numerics.Vector<double>(buffer);
    for (int i = 0; i < groups; i++)
    {
        //var next = new System.Numerics.Vector<double>(a, n * i);
        var next = new System.Numerics.Vector<double>(a.Slice(n * i, n));
        accumulator += next;
    }
    var sum = 0.0;
    // Add up the elements of the accumulator vs
    for (int j = 0; j < n; j++)
    {
        sum += accumulator[j];
    }
    return sum;
}

，所以我的问题是为什么我没有意识到SIMD?在这里有什么好处

基线

基线代码如下所示

public static double LinAlgSum(this ReadOnlySpan<double> span)
{
    double sum = 0;
    for (int i = 0; i < span.Length; i++)
    {
        sum += span[i];
    }
    return sum;
}

在对SIMD代码进行基准测试时，size=7的SIMD代码为5×慢速，size=144为2.5×慢速，size=770为2.5×。

我正在使用BenchmarkDotNet运行发布模式。这是驱动程序类

[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
[CategoriesColumn]
public class LinearAlgebraBench
{
    [Params(7, 35, 77, 144, 195, 311, 722)]
    public int Size { get; set; }

    [IterationSetup]
    public void SetupData()
    {
        A = new LinearAlgebra.Vector(Size, (iter) => 2 * Size - iter).ToArray();
        B = new LinearAlgebra.Vector(Size, (iter) => Size/2 + 2* iter).ToArray();
    }

    public double[] A { get; set; }
    public double[] B { get; set; }

    [BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
    public double BenchLinAlgSum()
    {
        return LinearAlgebra.LinearAlgebra.Sum(A.AsSpan().AsReadOnly());
    }
    [BenchmarkCategory("Sum"), Benchmark]
    public double BenchSimdSum()
    {
        return LinearAlgebra.LinearAlgebra.SumSimd(A);
    }
}

performance

simd

avx

arrays

回答 2

Stack Overflow用户

发布于 2021-05-19 18:28:47

我建议您看看这篇探索.Net中的SIMD性能的文章。

对于正则矢量化的求和，整个算法看起来是相同的。一个不同之处是，在对数组进行切片时，可以避免乘法：

while (i < lastBlockIndex)
{
    vresult += new Vector<int>(source.Slice(i));
    i += Vector<int>.Count;
}

一个乘法对性能的影响应该相当小，但是对于这类代码，它可能是相关的。

还值得注意的是，编译器似乎没有使用泛型API生成非常高效的SIMD代码。32768项产品的总结性能：

SumUnrolled - 8,979.690 ns
SumVectorT - 6,689.829 ns
SumIntrinstics - 2,200.996 ns

因此，SIMD的一般版本只获得了30%的性能，而本质版本则获得了~400%的性能，接近理论上的最大值。

票数 4

Stack Overflow用户

发布于 2021-05-20 16:07:21

试试这个版本。它使用四个独立的累加器来隐藏vaddpd指令的延迟，在现代AVX上是3-4个周期。未经测试。

public static double vectorSum( this ReadOnlySpan<double> span )
{
    int vs = Vector<double>.Count;
    int end = span.Length;
    int endVectors = ( end / vs ) * vs;

    // Using 4 independent accumulators because on modern CPUs the latency of `vaddpd` is 3-4 cycles.
    // One batch consumes 4 vectors.
    int endBatches = ( endVectors / 4 ) * 4;

    Vector<double> a0 = Vector<double>.Zero;
    Vector<double> a1 = Vector<double>.Zero;
    Vector<double> a2 = Vector<double>.Zero;
    Vector<double> a3 = Vector<double>.Zero;

    // Handle majority of data unrolling by 4 vectors (e.g. 16 scalars with AVX)
    int i;
    for( i = 0; i < endBatches; i += vs * 4 )
    {
        a0 += new Vector<double>( span.Slice( i, vs ) );
        a1 += new Vector<double>( span.Slice( i + vs, vs ) );
        a2 += new Vector<double>( span.Slice( i + vs * 2, vs ) );
        a3 += new Vector<double>( span.Slice( i + vs * 3, vs ) );
    }

    // Handle a few remaining complete vectors
    for( ; i < endVectors; i += vs )
        a0 += new Vector<double>( span.Slice( i, vs ) );

    // Add the accumulators together
    a0 += a1;
    a2 += a3;
    a0 += a2;

    // Compute horizontal sum of a0
    double sum = 0;
    for( int j = 0; j < vs; j++ )
        sum += a0[ j ];

    // Add the remaining few scalars
    for( ; i < end; i++ )
        sum += span[ i ];
    return sum;
}

没有设定基准，而是查看了sharplab.io上的反汇编。虽然不像等效的C++那样好(循环中的标量代码太多)，但它看起来并不糟糕:使用AVX，在主循环中没有函数调用或不必要的加载/存储。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67605744

复制

相似问题

问用SIMD (System.Numerics)编写向量和函数，使其比for循环更快
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用SIMD (System.Numerics)编写向量和函数，使其比for循环更快EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用SIMD (System.Numerics)编写向量和函数，使其比for循环更快
EN