文章/答案/技术大牛

发布

问优化C++代码以提高性能
EN

Stack Overflow用户

提问于 2010-09-08 21:59:27

回答 6查看 874关注 0票数 3

你能想出一些方法来优化这段代码吗？它是在ARMv7处理器(IPhone3GS)上执行的：

4.0%  inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
      {
0.7%    float *data = (float *) img->imageData;
1.4%    int step = img->widthStep/sizeof(float);

        // The subtraction by one for row/col is because row/col is inclusive.
1.1%    int r1 = std::min(row,          img->height) - 1;
1.0%    int c1 = std::min(col,          img->width)  - 1;
2.7%    int r2 = std::min(row + rows,   img->height) - 1;
3.7%    int c2 = std::min(col + cols,   img->width)  - 1;

        float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5%    if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7%   if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6%    if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2%    if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

21.9%   return std::max(0.f, A - B - C + D);
3.8%  }

所有这些代码都取自OpenSURF库。下面是函数的上下文(有些人在询问上下文)：

//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
{
  float *responses = rl->responses;         // response storage
  unsigned char *laplacian = rl->laplacian; // laplacian sign storage
  int step = rl->step;                      // step size for this filter
  int b = (rl->filter - 1) * 0.5 + 1;         // border for this filter
  int l = rl->filter / 3;                   // lobe for this filter (filter size / 3)
  int w = rl->filter;                       // filter size
  float inverse_area = 1.f/(w*w);           // normalisation factor
  float Dxx, Dyy, Dxy;

  for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar) 
  {
    for(int ac = 0; ac < rl->width; ++ac, index++) 
    {
      // get the image coordinates
      r = ar * step;
      c = ac * step; 

      // Compute response components
      Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
          - BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
      Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
          - BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
      Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
            + BoxIntegral(img, r + 1, c - l, l, l)
            - BoxIntegral(img, r - l, c - l, l, l)
            - BoxIntegral(img, r + 1, c + 1, l, l);

      // Normalise the filter responses with respect to their size
      Dxx *= inverse_area;
      Dyy *= inverse_area;
      Dxy *= inverse_area;

      // Get the determinant of hessian response & laplacian sign
      responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
      laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);

#ifdef RL_DEBUG
      // create list of the image coords for each response
      rl->coords.push_back(std::make_pair<int,int>(r,c));
#endif
    }
  }
}

以下是一些问题：

函数是内联的，这是个好主意吗？使用内联汇编会提供显著的加速吗？

c++

iphone

performance

optimization

回答 6

Stack Overflow用户

回答已采纳

发布于 2010-09-08 22:12:25

专门化边缘，这样你就不需要在每一行和每一列检查它们。我假设这个调用是在一个嵌套循环中，并且被调用了很多次。此函数将变为：

inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols) 
{
  float *data = (float *) img->imageData;
  int step = img->widthStep/sizeof(float);

  // The subtraction by one for row/col is because row/col is inclusive.
  int r1 = row - 1;
  int c1 = col - 1;
  int r2 = row + rows - 1;
  int c2 = col + cols - 1;

  float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);

  return std::max(0.f, A - B - C + D);
}

您为每个min去掉了一个条件and分支，为每个if去掉了两个条件和一个分支。只有在已经满足条件的情况下才能调用此函数--在调用方中为整行检查一次，而不是在每个像素检查一次。

当你必须在每个像素上做工作时，我写了一些优化图像处理的技巧：

http://www.atalasoft.com/cs/blogs/loufranco/archive/2006/04/28/9985.aspx

博客中的其他内容：

你正在用2次乘法重新计算图像数据中的一个位置(索引是乘法) --你应该递增一个pointer.
Instead，传入img，，
1. ，
2. 和col，传入指向要处理的确切像素的指针--这是通过递增指针得到的，而不是索引。
3. 如果你不这样做，步骤对于所有像素都是相同的，在调用程序中计算它并传递它。如果你做1和2，你根本不需要step。

票数 8

Stack Overflow用户

发布于 2010-09-08 22:11:46

有几个地方可以重用临时变量，但它是否会提高性能将不得不像迪克斯温特所说的那样来衡量：

变化

  if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1]; 
  if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2]; 
  if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1]; 
  if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

至

  if (r1 >= 0) {
    int r1Step = r1 * step;
    if (c1 >= 0) A = data[r1Step + c1]; 
    if (c2 >= 0) B = data[r1Step + c2]; 
  }
  if (r2 >= 0) {
    int r2Step = r2 * step;
    if (c1 >= 0) C = data[r2Step + c1]; 
    if (c2 >= 0) D = data[r2Step + c2]; 
  }

在if语句很少提供true的情况下，您实际上可能会过于频繁地执行临时多点替换。

票数 1

Stack Overflow用户

发布于 2010-09-08 22:36:02

您对四个变量A、B、C、D不感兴趣，只对组合A - B - C + D感兴趣。

试一试

float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];

if (result > 0f) return result;
return 0f;

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/3668454

复制

相似问题

问优化C++代码以提高性能
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化C++代码以提高性能EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化C++代码以提高性能
EN