我编写了一个函数,使用SIMD (System.Numerics.Vector)将double[]数组的所有元素相加,其性能比天真的方法差。
在我的计算机上,Vector<double>.Count是4,这意味着我可以创建一个由4个值组成的累加器,并通过数组将元素按组相加。
例如,一个10元素数组,其中包含4个元素累加器和2个剩余的元素。
// | loop | remainder
acc[0] = vector[0] + vector[4] + vector[8]
acc[1] = vector[1] + vector[5] + vector[9]
acc[2] = vector[2] + vector[6]
acc[3] = vector[3] + vector[7] 和结果sum = acc[0]+acc[1]+acc[2]+acc[3]
下面的代码产生了正确的结果,但是速度与仅仅一个循环相加值相比是不存在的
public static double SumSimd(this Span<double> a)
{
var n = System.Numerics.Vector<double>.Count;
var count = a.Length;
// divide array into n=4 element groups
// Example, 57 = 14*4 + 3
var groups = Math.DivRem(count, n, out int remain);
var buffer = new double[n];
// Create buffer with remaining elements (not in groups)
a.Slice(groups*n, remain).CopyTo(buffer);
// Scan through all groups and accumulate
var accumulator = new System.Numerics.Vector<double>(buffer);
for (int i = 0; i < groups; i++)
{
//var next = new System.Numerics.Vector<double>(a, n * i);
var next = new System.Numerics.Vector<double>(a.Slice(n * i, n));
accumulator += next;
}
var sum = 0.0;
// Add up the elements of the accumulator vs
for (int j = 0; j < n; j++)
{
sum += accumulator[j];
}
return sum;
},所以我的问题是为什么我没有意识到SIMD?在这里有什么好处
基线
基线代码如下所示
public static double LinAlgSum(this ReadOnlySpan<double> span)
{
double sum = 0;
for (int i = 0; i < span.Length; i++)
{
sum += span[i];
}
return sum;
}在对SIMD代码进行基准测试时,size=7的SIMD代码为5×慢速,size=144为2.5×慢速,size=770为2.5×。
我正在使用BenchmarkDotNet运行发布模式。这是驱动程序类
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
[CategoriesColumn]
public class LinearAlgebraBench
{
[Params(7, 35, 77, 144, 195, 311, 722)]
public int Size { get; set; }
[IterationSetup]
public void SetupData()
{
A = new LinearAlgebra.Vector(Size, (iter) => 2 * Size - iter).ToArray();
B = new LinearAlgebra.Vector(Size, (iter) => Size/2 + 2* iter).ToArray();
}
public double[] A { get; set; }
public double[] B { get; set; }
[BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
public double BenchLinAlgSum()
{
return LinearAlgebra.LinearAlgebra.Sum(A.AsSpan().AsReadOnly());
}
[BenchmarkCategory("Sum"), Benchmark]
public double BenchSimdSum()
{
return LinearAlgebra.LinearAlgebra.SumSimd(A);
}
}发布于 2021-05-19 18:28:47
我建议您看看这篇探索.Net中的SIMD性能的文章。
对于正则矢量化的求和,整个算法看起来是相同的。一个不同之处是,在对数组进行切片时,可以避免乘法:
while (i < lastBlockIndex)
{
vresult += new Vector<int>(source.Slice(i));
i += Vector<int>.Count;
}一个乘法对性能的影响应该相当小,但是对于这类代码,它可能是相关的。
还值得注意的是,编译器似乎没有使用泛型API生成非常高效的SIMD代码。32768项产品的总结性能:
因此,SIMD的一般版本只获得了30%的性能,而本质版本则获得了~400%的性能,接近理论上的最大值。
发布于 2021-05-20 16:07:21
试试这个版本。它使用四个独立的累加器来隐藏vaddpd指令的延迟,在现代AVX上是3-4个周期。未经测试。
public static double vectorSum( this ReadOnlySpan<double> span )
{
int vs = Vector<double>.Count;
int end = span.Length;
int endVectors = ( end / vs ) * vs;
// Using 4 independent accumulators because on modern CPUs the latency of `vaddpd` is 3-4 cycles.
// One batch consumes 4 vectors.
int endBatches = ( endVectors / 4 ) * 4;
Vector<double> a0 = Vector<double>.Zero;
Vector<double> a1 = Vector<double>.Zero;
Vector<double> a2 = Vector<double>.Zero;
Vector<double> a3 = Vector<double>.Zero;
// Handle majority of data unrolling by 4 vectors (e.g. 16 scalars with AVX)
int i;
for( i = 0; i < endBatches; i += vs * 4 )
{
a0 += new Vector<double>( span.Slice( i, vs ) );
a1 += new Vector<double>( span.Slice( i + vs, vs ) );
a2 += new Vector<double>( span.Slice( i + vs * 2, vs ) );
a3 += new Vector<double>( span.Slice( i + vs * 3, vs ) );
}
// Handle a few remaining complete vectors
for( ; i < endVectors; i += vs )
a0 += new Vector<double>( span.Slice( i, vs ) );
// Add the accumulators together
a0 += a1;
a2 += a3;
a0 += a2;
// Compute horizontal sum of a0
double sum = 0;
for( int j = 0; j < vs; j++ )
sum += a0[ j ];
// Add the remaining few scalars
for( ; i < end; i++ )
sum += span[ i ];
return sum;
}没有设定基准,而是查看了sharplab.io上的反汇编。虽然不像等效的C++那样好(循环中的标量代码太多),但它看起来并不糟糕:使用AVX,在主循环中没有函数调用或不必要的加载/存储。
https://stackoverflow.com/questions/67605744
复制相似问题