首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >gcc原子结构如此缓慢,这是正常的吗?

gcc原子结构如此缓慢,这是正常的吗?
EN

Stack Overflow用户
提问于 2012-07-23 08:27:16
回答 3查看 9.4K关注 0票数 8

我有一个应用程序,在这个应用程序中,我必须在多线程方法中增加一些统计计数器。增量必须是线程安全的,所以我决定使用gcc原子内置的__sync_add_and_fetch()函数。为了了解它们的影响,我做了一些简单的性能测试,并注意到这些函数比简单的前后增量要慢得多。

下面是我创建的测试程序:

代码语言:javascript
复制
#include <iostream>
#include <pthread.h>
#include <time.h>

using namespace std;

uint64_t diffTimes(struct timespec &start, struct timespec &end)
{
  if(start.tv_sec == end.tv_sec)
  {
    return end.tv_nsec - start.tv_nsec;
  }
  else if(start.tv_sec < end.tv_sec)
  {
    uint64_t nsecs = (end.tv_sec - start.tv_sec) * 1000000000;
    return nsecs + end.tv_nsec - start.tv_nsec;
  }
  else
  {
    // this is actually an error
    return 0;
  }
}

void outputResult(const char *msg, struct timespec &start, struct timespec &end, uint32_t numIterations, uint64_t val)
{
  uint64_t diff = diffTimes(start, end);
  cout << msg << ": "
       << "\n\t iterations: " << numIterations
       << ", result: " << val
       << "\n\t times [start, end] =  [" << start.tv_sec << ", " << start.tv_nsec << "]"
       << "\n\t [" << end.tv_sec << ", " << end.tv_nsec << "]"
       << "\n\t [total, avg] = [" << diff
       << ", " << (diff/numIterations) << "] nano seconds"
       << endl;
}

int main(int argc, char **argv)
{
  struct timespec start, end;
  uint64_t val = 0;
  uint32_t numIterations = 1000000;

  //
  // NON ATOMIC pre increment
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    ++val;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Non-Atomic pre-increment", start, end, numIterations, val);
  val = 0;

  //
  // NON ATOMIC post increment
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    val++;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Non-Atomic post-increment", start, end, numIterations, val);
  val = 0;

  //
  // ATOMIC add and fetch
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    __sync_add_and_fetch(&val, 1);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Atomic add and fetch", start, end, numIterations, val);
  val = 0;

  //
  // ATOMIC fetch and add
  //
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    __sync_fetch_and_add(&val, 1);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Atomic fetch and add", start, end, numIterations, val);
  val = 0;

  //
  // Mutex protected post-increment
  //
  pthread_mutex_t mutex;
  pthread_mutex_init(&mutex, NULL);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    pthread_mutex_lock(&mutex);
    val++;
    pthread_mutex_unlock(&mutex);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("Mutex post-increment", start, end, numIterations, val);
  val = 0;

  //
  // RWlock protected post-increment
  //
  pthread_rwlock_t rwlock;
  pthread_rwlock_init(&rwlock, NULL);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(uint32_t i = 0; i < numIterations; ++i)
  {
    pthread_rwlock_wrlock(&rwlock);
    val++;
    pthread_rwlock_unlock(&rwlock);
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

  outputResult("RWlock post-increment", start, end, numIterations, val);
  val = 0;

  return 0;
}

以下是研究结果:

代码语言:javascript
复制
# ./atomicVsNonAtomic
Non-Atomic pre-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1585375]
         [0, 1586185]
         [total, avg] = [810, 0] nano seconds
Non-Atomic post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1667489]
         [0, 1667920]
         [total, avg] = [431, 0] nano seconds
Atomic add and fetch:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1682037]
         [0, 16595016]
         [total, avg] = [14912979, 14] nano seconds
Atomic fetch and add:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 16617178]
         [0, 31499571]
         [total, avg] = [14882393, 14] nano seconds
Mutex post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 31526810]
         [0, 68515763]
         [total, avg] = [36988953, 36] nano seconds
RWlock post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 68547649]
         [0, 110877351]
         [total, avg] = [42329702, 42] nano seconds

这是gcc的汇编:

代码语言:javascript
复制
g++ -o atomicVsNonAtomic.o -c -march=i686 -O2 -I. atomicVsNonAtomic.cc
g++ -o atomicVsNonAtomic atomicVsNonAtomic.o -lrt -lpthread

以及相关信息和版本:

代码语言:javascript
复制
# gcc --version
gcc (GCC) 4.3.2
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

# uname -a
Linux gtcba2v1 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 i686 i686 i386 GNU/Linux

现在,对于实际的问题:)原子操作这么慢是正常的吗?

一百万次迭代的差异是:

  • 简单后增量: 431纳米秒
  • 原子获取和添加操作: 14882393纳米秒

当然,我理解原子操作应该更昂贵,但这似乎被夸大了。为了完整起见,我还检查了一个p线程互斥锁和rwlock。至少原子操作比线程操作更快,但是我仍然想知道我是否做错了什么。如果不指定-march=i686选项,我无法让它链接,这可能会产生影响吗?

更新:

我对-O2编译器进行了优化,得到了更一致的结果如下:

代码语言:javascript
复制
# ./atomicVsNonAtomic
Non-Atomic pre-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 1647303]
         [0, 4171164]
         [total, avg] = [2523861, 2] nano seconds
Non-Atomic post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 4310230]
         [0, 7262704]
         [total, avg] = [2952474, 2] nano seconds
Atomic add and fetch:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 7285996]
         [0, 25919067]
         [total, avg] = [18633071, 18] nano seconds
Atomic fetch and add:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 25941677]
         [0, 44544234]
         [total, avg] = [18602557, 18] nano seconds
Mutex post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 44573933]
         [0, 82318615]
         [total, avg] = [37744682, 37] nano seconds
RWlock post-increment:
         iterations: 1000000, result: 1000000
         times [start, end] =  [0, 82344866]
         [0, 125124498]
         [total, avg] = [42779632, 42] nano seconds
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2012-07-23 08:40:18

答案是GCC优化了你的非原子增量。当它看到一个循环时:

代码语言:javascript
复制
for (int i=0; i<N; i++) x++;

它将其替换为:

代码语言:javascript
复制
x += N;

这可以在生成的程序集中看到,该程序集包含:

代码语言:javascript
复制
call    clock_gettime
leal    -32(%ebp), %edx
addl    $1000000, -40(%ebp)     <- increment by 1000000
adcl    $0, -36(%ebp)
movl    %edx, 4(%esp)
movl    $2, (%esp)
call    clock_gettime

所以你不是在衡量你认为自己是什么。

您可以使变量volatile来防止这种优化。在我的计算机上,在这样做之后,非原子访问的速度大约是原子访问的8倍。当使用32位变量而不是64位(我正在编译为32位)时,差异下降到大约3倍。

票数 19
EN

Stack Overflow用户

发布于 2012-07-23 08:40:38

我猜gcc正在优化你的非原子增量操作

代码语言:javascript
复制
val += numIterations;

您说10^6的增量需要431纳秒,每循环迭代要计算到0.000431 ns。在一个4 GHz处理器上,时钟周期是0.25 ns,所以很明显,循环被优化了。这解释了你所看到的巨大的性能差异。

编辑:您测量了一个原子操作需要14 ns --假设再次使用一个4 GHz处理器,计算出56个周期,这是相当不错的!

票数 6
EN

Stack Overflow用户

发布于 2012-07-23 09:35:10

任何同步机制的慢度都不能用一个线程来衡量。像POSIX互斥/Windows关键部分这样的单进程同步对象在被竞争时实际上只需要花费时间。

您需要引入几个线程--做其他反映实际应用程序时间的工作--这样同步方法才能真正了解需要多长时间。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/11608869

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档