我在Ubuntu 16.04上运行了一个java测试,我发现打开和关闭PTI的性能有差异。
我的主机系统使用CPU Ivybridge (2核,4 HT) 1.6 CPU,内存16 My。
我尝试使用perf来分析差异来自哪里,如下所示。
在grub.cfg中使用pti=off,
# perf stat -e bus-cycles,cache-misses,cache-references,L1-dcache-load-misses,dTLB-load-misses,L1-dcache-prefetch-misses,LLC-prefetches ./test.sh
Performance counter stats for './test.sh':
774,986,827 bus-cycles (59.13%)
24,044,906 cache-misses # 12.803 % of all cache refs (58.17%)
187,799,652 cache-references (57.51%)
207,345,039 L1-dcache-load-misses (57.65%)
13,081,612 dTLB-load-misses (58.85%)
22,678,453 L1-dcache-prefetch-misses (59.62%)
24,089,506 LLC-prefetches (59.99%)
6.210151360 seconds time elapsed使用pti=on ( Linux内核中的默认设置),我得到了,
# perf stat -e bus-cycles,cache-misses,cache-references,L1-dcache-load-misses,dTLB-load-misses,L1-dcache-prefetch-misses,LLC-prefetches ./test.sh “./test.sh”的性能计数器统计信息:
1,205,903,578 bus-cycles (57.92%)
23,877,107 cache-misses # 13.167 % of all cache refs (57.31%)
181,340,147 cache-references (57.46%)
206,177,901 L1-dcache-load-misses (58.42%)
63,285,591 dTLB-load-misses (59.06%)
24,012,988 L1-dcache-prefetch-misses (58.65%)
24,928,410 LLC-prefetches (58.23%)
10.344839116 seconds time elapsed test.sh是要分析的程序,从上面的perf输出来看,test.sh使用pti=on比使用pti=off花费了更多的时间,但是事件输出不清楚区别来自哪里。
在这种情况下,是否有其他perf事件可以提供帮助?
更新了更多的perf事件。
PTI=off
# perf stat --repeat 5 -e cache-references,cache-misses,cpu-cycles,ref-cycles,faults,L1-dcache-loads,L1-dcache-load-misses,L1-icache-load-misses,branches,branch-misses,node-loads,node-load-misses,instructions,cs java mytest
Performance counter stats for 'java mytest' (5 runs):
8,711,306 cache-references ( +- 4.13% ) (48.49%)
1,290,234 cache-misses # 14.811 % of all cache refs ( +- 4.04% ) (49.44%)
709,587,381 cpu-cycles ( +- 1.44% ) (48.91%)
671,299,480 ref-cycles ( +- 1.95% ) (58.09%)
5,918 faults ( +- 0.12% )
185,928,475 L1-dcache-loads ( +- 4.29% ) (35.90%)
9,249,983 L1-dcache-load-misses # 4.98% of all L1-dcache hits ( +- 5.91% ) (27.84%)
4,718,632 L1-icache-load-misses ( +- 5.47% ) (20.83%)
106,021,866 branches ( +- 1.98% ) (31.56%)
4,487,091 branch-misses # 4.23% of all branches ( +- 5.35% ) (40.34%)
450,170 node-loads ( +- 9.18% ) (38.32%)
0 node-load-misses (40.62%)
509,344,631 instructions # 0.72 insns per cycle ( +- 5.59% ) (49.64%)
458 cs ( +- 2.05% )
0.216794242 seconds time elapsed ( +- 3.44% ) PTI=ON
# perf stat --repeat 5 -e cache-references,cache-misses,cpu-cycles,ref-cycles,faults,L1-dcache-loads,L1-dcache-load-misses,L1-icache-load-misses,branches,branch-misses,node-loads,node-load-misses,instructions,cs java mytest
Performance counter stats for 'java mytest' (5 runs):
10,109,469 cache-references ( +- 4.10% ) (44.67%)
1,360,012 cache-misses # 13.453 % of all cache refs ( +- 2.16% ) (45.28%)
1,199,960,141 cpu-cycles ( +- 2.44% ) (46.13%)
1,086,243,141 ref-cycles ( +- 1.28% ) (54.64%)
5,923 faults ( +- 0.24% )
163,902,394 L1-dcache-loads ( +- 3.46% ) (41.91%)
8,588,505 L1-dcache-load-misses # 5.24% of all L1-dcache hits ( +- 5.59% ) (27.82%)
5,576,811 L1-icache-load-misses ( +- 3.87% ) (18.41%)
117,508,300 branches ( +- 3.98% ) (27.34%)
4,878,640 branch-misses # 4.15% of all branches ( +- 2.28% ) (35.55%)
585,464 node-loads ( +- 9.05% ) (34.55%)
0 node-load-misses (36.68%)
614,773,322 instructions # 0.51 insns per cycle ( +- 4.11% ) (46.10%)
476 cs ( +- 2.75% )
0.375871969 seconds time elapsed ( +- 0.81% )发布于 2018-06-13 08:58:55
不知道“总线周期”实际测量的是什么事件。核心时钟周期通常更相关。
但不管怎样,PTI=on使每个系统调用(以及进入内核的其他条目)的开销更大,因为它必须修改x86 CR3控制寄存器(设置一个新的页表顶级指针)。这就是它将用户空间与内核页表的访问隔离开来的方式。
注意到 dTLB-load-misses.的大幅增长有了进程上下文it (PCID)支持,PTI可能能够避免在每次TLB进入内核时完全刷新TLB。但我不知道细节。(如果没有PCID,替换页表将使整个TLB失效。)
您可以使用strace -c来计算系统调用的时间。
使用perf record (具有足够的权限),您可以记录包含内核代码的示例,这样您就可以看到内核中的哪些指令实际花费了很长时间。(mov到CR3也需要时间,光谱缓解也需要时间,这与熔毁缓解(PTI)是分开的。但我认为Meltdown缓解的大部分代价是TLB在内核内的一段时间内丢失,并在返回到用户空间后再次丢失,而不是从实际的交换页表中丢失。)
https://stackoverflow.com/questions/50827560
复制相似问题