文章/答案/技术大牛

发布

社区首页 >问答首页 >大型矩阵线性组合的C++性能优化

问大型矩阵线性组合的C++性能优化
EN

Stack Overflow用户

提问于 2022-04-24 04:12:00

回答 3查看 173关注 0票数 1

我有一个很大的浮点数据张量，它的维数是35k(rows) x 45(cols) x 150(slices)，我已经将它存储在一个紫禁城立方体容器中。我需要线性组合所有150片在35毫秒以下(我的申请必须)。线性组合浮点重量也存储在一个鲤鱼容器中。到目前为止，我最快的实现速度达到了70毫秒，平均超过了30帧，而且我似乎无法超越这一点。请注意，我被允许CPU并行计算，但不允许GPU。

我尝试过多种不同的方法来执行这种线性组合，但是下面的代码似乎是我所能得到的最快的(70 ms)，因为我相信我正在通过每次迭代获取最大可能的连续内存块来最大化缓存命中机会。

请注意，Armadillo以列的主要格式存储数据。所以在张量中，它首先存储第一个通道的列，然后是第二个通道的列，然后是第三个，等等。

typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;

int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);

double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {

    Timer::time_point start = Timer::now();

    arma::fmat result(rows, cols, arma::fill::zeros);
    for (int i = 0; i < slices; i++)
        result += tensor.slice(i) * w(i);

    Timer::time_point end = Timer::now();
    Duration span = end - start;
    double t = span.count();
    overallTime += t;
    cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}

cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;

我需要通过(至少2倍)优化这段代码，我非常感谢任何建议。

performance

multiplication

armadillo

c++

回答 3

Stack Overflow用户

回答已采纳

发布于 2022-04-26 21:32:52

正如@hbrerkere在注释部分中所建议的那样，通过使用-O3标志并进行以下更改，性能几乎提高了65%。代码现在运行在45 ms，而不是最初的70 ms。

int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
    result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
    i += 4;
}
while (i < slices) {
    result += tensor.slice(i) * w_id(i);
    i++;
}

票数 1

Stack Overflow用户

发布于 2022-04-24 04:54:50

首先，我必须承认，我对arma框架或内存布局并不熟悉；至少如果语法result += slice(i) * weight计算迟缓的话。

两个主要的问题及其解决方法在于内存布局和内存与算术计算的比率。

要说a+=b*c是有问题的，因为它需要读取b和a，编写a并使用最多两个算术操作(如果体系结构没有将乘法和累加结合起来，则使用两个算术操作)。

如果内存布局为float tensor[rows][columns][channels]格式，则将问题转换为制作长度为channels的rows * columns点积，并应表示为这样。

如果是float tensor[c][h][w]，最好将循环展开到result+= slice(i) + slice(i+1)+...。一次读四片可以减少50%的内存传输。

甚至最好以4*N结果块(从所有150个通道/片读取N<16 )来处理结果，这样编译器就可以显式或隐式地将累加器分配给SIMD寄存器。

通过将切片计数填充到4或8的倍数，使用-ffast-math进行编译以启用融合乘法累加(如果可用的话)和多线程，可能会有轻微的改进。

这些限制表明需要执行13.5GFlops，这在算法上是一个合理的数目(对于许多现代体系结构而言)，但也意味着至少54 Gb/s的内存带宽，可以通过fp16或16位不动点算法来放松。

编辑

知道内存顺序为float tensor[150][45][35000]或float tensor[kSlices][kRows * kCols == kCols * kRows]对我建议，首先尝试将外部循环展开4(甚至5，因为150不能被4除以4，需要额外的特殊情况)流。

void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
    // ensure that the cols*rows is a multiple of 4 (pad if necessary)
    // - allows the auto vectorizer to skip handling the 'excess' code where the data
    //   length mod simd width != 0
    // one could try even SIMD width of 16*4, as clang 14
    // can further unroll the inner loop to 4 ymm registers
    auto const stride = (kCols * kRows + 3) & ~3;
    // try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
    for (int s = 0; s < 150; s+=5) {
        auto src0 = tensor  + s * stride;
        auto src1 = src0 + stride;
        auto src2 = src1 + stride;
        auto src3 = src2 + stride;
        auto src4 = src3 + stride;
        auto dst = result;
        for (int x = 0; x < stride; x++) {
            // clang should be able to optimize caching the weights
            // to registers outside the innerloop
            auto add = src0[x] * w[s] +
                       src1[x] * w[s+1] +
                       src2[x] * w[s+2] +
                       src3[x] * w[s+3] +
                       src4[x] * w[s+4];
            // clang should be able to optimize this comparison
            // out of the loop, generating two inner kernels
            if (s == 0) {
                dst[x] = add;
            } else {
                dst[x] += add;
            }
        }
    }
}

编辑2

另一个起点(在添加多线程之前)将考虑将布局更改为

float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional

现在的缺点是kSlices = 150不能再适应寄存器中的所有权重(其次，kSlices不是4或8的倍数)。此外，最后的削减需要是横向的。

好处是，减少不再需要经过内存，这是一个很大的事情增加多线程。

void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
     // each thread will read from 4 positions in order
     // to share the weights -- finding the best distance
     // might need some iterations
     auto src0 = tensor;
     auto src1 = src0 + c;
     auto src2 = src1 + c;
     auto src3 = src2 + c; 
     for (int i = 0; i < n/4; i++) {
         vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
         // #pragma unroll?
         for (auto j = 0; j < c / 8; c++) {
             vec8 w(w + j);
             acc0 += w * vec8(src0 + j);
             acc1 += w * vec8(src1 + j);
             acc2 += w * vec8(src2 + j);
             acc3 += w * vec8(src3 + j);
         }
         vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
         sum.store(dst); dst+=4;
     } 
}

这些vec4和vec8是一些自定义的SIMD类，它们通过本质映射到SIMD指令，或者借助编译器能够将using vec4 = float __attribute__ __attribute__((vector_size(16)));编译成高效的SIMD代码。

票数 2

Stack Overflow用户

发布于 2022-04-26 21:46:59

如果没有真正的代码，我猜

+= tensor.slice(i) * w_id(i)

创建一个临时对象，然后将其添加到lhs。是的，重载操作符看起来不错，但是我会编写一个函数。

外接程序( lhs，slice1，w1，slice2，w2，....unroll to 4.)

它转化为元素上的纯循环：

for (i=....)
  for (j=...)
    lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c

如果这不给你买一个额外的因素，我会大吃一惊的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71985447

复制

相似问题

问大型矩阵线性组合的C++性能优化
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问大型矩阵线性组合的C++性能优化EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问大型矩阵线性组合的C++性能优化
EN