文章/答案/技术大牛

发布

社区首页 >问答首页 >高速缓存线对齐优化不降低缓存丢失率

问高速缓存线对齐优化不降低缓存丢失率
EN

Stack Overflow用户

提问于 2021-12-17 21:57:24

回答 1查看 98关注 0票数 1

我得到了这段代码，演示了如何通过减少来自http://blog.kongfy.com/2016/10/cache-coherence-sequential-consistency-and-memory-barrier/的“错误共享”来优化缓存行对齐。

代码：

/*
 * Demo program for showing the drawback of "false sharing"
 *
 * Use it with perf!
 *
 * Compile: g++ -O2 -o false_share false_share.cpp -lpthread
 * Usage: perf stat -e cache-misses ./false_share <loopcount> <is_aligned>
 */

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>

#define CACHE_ALIGN_SIZE 64
#define CACHE_ALIGNED __attribute__((aligned(CACHE_ALIGN_SIZE)))

int gLoopCount;

inline int64_t current_time()
{
  struct timeval t;
  if (gettimeofday(&t, NULL) < 0) {
  }
  return (static_cast<int64_t>(t.tv_sec) * static_cast<int64_t>(1000000) + static_cast<int64_t>(t.tv_usec));
}

struct value {
  int64_t val;
};
value data[2] CACHE_ALIGNED;

struct aligned_value {
  int64_t val;
} CACHE_ALIGNED;
aligned_value aligned_data[2] CACHE_ALIGNED;

void* worker1(int64_t *val)
{
  printf("worker1 start...\n");

  volatile int64_t &v = *val;
  for (int i = 0; i < gLoopCount; ++i) {
    v += 1;
  }

  printf("worker1 exit...\n");
}

// duplicate worker function for perf report
void* worker2(int64_t *val)
{
  printf("worker2 start...\n");

  volatile int64_t &v = *val;
  for (int i = 0; i < gLoopCount; ++i) {
    v += 1;
  }

  printf("worker2 exit...\n");
}

int main(int argc, char *argv[])
{
  pthread_t race_thread_1;
  pthread_t race_thread_2;

  bool is_aligned;

  /* Check arguments to program*/
  if(argc != 3) {
    fprintf(stderr, "USAGE: %s <loopcount> <is_aligned>\n", argv[0]);
    exit(1);
  }

  /* Parse argument */
  gLoopCount = atoi(argv[1]); /* Don't bother with format checking */
  is_aligned = atoi(argv[2]); /* Don't bother with format checking */

  printf("size of unaligned data : %d\n", sizeof(data));
  printf("size of aligned data   : %d\n", sizeof(aligned_data));

  void *val_0, *val_1;
  if (is_aligned) {
    val_0 = (void *)&aligned_data[0].val;
    val_1 = (void *)&aligned_data[1].val;
  } else {
    val_0 = (void *)&data[0].val;
    val_1 = (void *)&data[1].val;
  }

  int64_t start_time = current_time();

  /* Start the threads */
  pthread_create(&race_thread_1, NULL, (void* (*)(void*))worker1, val_0);
  pthread_create(&race_thread_2, NULL, (void* (*)(void*))worker2, val_1);

  /* Wait for the threads to end */
  pthread_join(race_thread_1, NULL);
  pthread_join(race_thread_2, NULL);

  int64_t end_time = current_time();

  printf("time : %d us\n", end_time - start_time);

  return 0;
}

预期结果：

[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data   : 128
worker2 start...
worker1 start...
worker1 exit...
worker2 exit...
time : 452451 us

 Performance counter stats for './false_share 100000000 0':

         3,105,245 cache-misses

       0.455033803 seconds time elapsed

[jingyan.kfy@OceanBase224006 work]$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data   : 128
worker1 start...
worker2 start...
worker1 exit...
worker2 exit...
time : 326994 us

 Performance counter stats for './false_share 100000000 1':

            27,735 cache-misses

       0.329737667 seconds time elapsed

但是，我自己运行了代码并获得了非常接近的运行时，如果没有对齐，缓存丢失计数就会更低：

我的结果：

$ perf stat -e cache-misses ./false_share 100000000 0
size of unaligned data : 16
size of aligned data   : 128
worker1 start...
worker2 start...
worker2 exit...
worker1 exit...
time : 169465 us

 Performance counter stats for './false_share 100000000 0':

            37,698      cache-misses:u                                              

       0.171625603 seconds time elapsed

       0.334919000 seconds user
       0.001988000 seconds sys


$ perf stat -e cache-misses ./false_share 100000000 1
size of unaligned data : 16
size of aligned data   : 128
worker2 start...
worker1 start...
worker2 exit...
worker1 exit...
time : 118798 us

 Performance counter stats for './false_share 100000000 1':

            38,375      cache-misses:u                                              

       0.121072715 seconds time elapsed

       0.230043000 seconds user
       0.001973000 seconds sys

我该如何理解这种不一致？

c++

gcc

caching

optimization

g++

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-12-25 03:33:59

这是很难帮助，因为你提到的博客是中文。不过，我注意到第一个数字似乎显示了一个多套接字架构。所以我做了一些实验。

a)我的PC，Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz，单插座，两个核，每个核两个三片：

0：

time : 195389 us

 Performance counter stats for './a.out 100000000 0':

             8 980      cache-misses:u                                              

       0,198584628 seconds time elapsed

       0,391694000 seconds user
       0,000000000 seconds sys

和1：

time : 191413 us

 Performance counter stats for './a.out 100000000 1':

             9 020      cache-misses:u                                              

       0,192953853 seconds time elapsed

       0,378434000 seconds user
       0,000000000 seconds sys

没什么区别。

( b)现在是2口工作站。

每个核心的

线程:2

每个插座核心: 12

插座:2

NUMA节点:2

型号名称: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

0：

time : 454679 us

 Performance counter stats for './a.out 100000000 0':

         5,644,133      cache-misses                                                

       0.456665966 seconds time elapsed

       0.738173000 seconds user

1：

time : 346871 us

 Performance counter stats for './a.out 100000000 1':

            42,217      cache-misses                                                

       0.348814583 seconds time elapsed

       0.539676000 seconds user
       0.000000000 seconds sys

差别很大。

最后一句话。你写：

缓存丢失计数在未对齐时甚至更低

不，它不是。你的处理器正在运行各种任务，除了你的程序。另外，您正在运行两个线程，它们可以在不同的时间序列上访问缓存。所有这些都可能影响缓存的利用率。你需要重复你的测量数次并进行比较。就我个人而言，当我看到任何性能结果相差不到10%时，我认为它们是无法区分的。

更新

我还对扩展到3个线程的代码进行了实验，这样当然其中一些线程必须运行在不同的内核上，因此，只共享L3缓存。

我查看了How to catch the L3-cache hits and misses by perf tool in Linux并附带了以下命令：

 perf stat -e cache-misses,cache-references,LLC-loads,LLC-stores,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./a.out 100000000 0

0：

time : 214253 us

 Performance counter stats for './a.out 100000000 0':

             4 765      cache-misses:u            #    0,018 % of all cache refs      (57,39%)
        25 992 887      cache-references:u                                            (57,56%)
        17 430 736      LLC-loads:u                                                   (57,56%)
         8 591 378      LLC-stores:u                                                  (57,56%)
        28 110 342      L1-dcache-load-misses:u                                       (57,40%)
        14 661 378      L1-dcache-prefetch-misses:u                                     (57,80%)
            32 269      L1-dcache-store-misses:u                                      (57,49%)

       0,215484922 seconds time elapsed

       0,627426000 seconds user
       0,006635000 seconds sys

1：

time : 194253 us

 Performance counter stats for './a.out 100000000 1':

             4 509      cache-misses:u            #   30,715 % of all cache refs      (57,15%)
            14 680      cache-references:u                                            (57,45%)
             7 954      LLC-loads:u                                                   (57,49%)
             1 565      LLC-stores:u                                                  (57,92%)
             4 442      L1-dcache-load-misses:u                                       (57,91%)
               836      L1-dcache-prefetch-misses:u                                     (57,02%)
               984      L1-dcache-store-misses:u                                      (56,85%)

       0,195145645 seconds time elapsed

       0,569986000 seconds user
       0,000000000 seconds sys

因此：

对齐(3线程)版本比未对齐(我多次重复测试)更系统地运行(有点)，甚至在单套接字machine.

it's中也不太清楚，在L1缓存、LLC缓存和这些基于硬件的统计数据中，“错误数据共享”实际上会给reports

there's带来很大的(数字)损失:如果其他进程正在运行，它们会为这些结果添加贡献(

)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70399271

复制

相似问题

问高速缓存线对齐优化不降低缓存丢失率
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问高速缓存线对齐优化不降低缓存丢失率EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问高速缓存线对齐优化不降低缓存丢失率
EN