我正在运行DNA链的模拟,涉及到大量的浮点数运算。完整的代码在这里:https://github.com/RoaldFre/DNA
在与gcc和clang一起编译后,我用google-perftools做了一些分析。在这两种情况下,谷歌工具都表示,大约28%的时间都花在fegetexcept()上。这似乎是查询CPU浮点异常标志的C库的一个函数。
请注意,我用的是gcc的数学,如果我没有弄错的话,应该忽略(全部?)浮点例外!我还在使用-O4和clang (是否有一个单独的标志来启用不安全的浮点指令)。
用clang:编译的二进制文件的分析输出
Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
263 16.8% 44.9% 263 16.8% cos
224 14.3% 59.3% 224 14.3% Vdihedral
131 8.4% 67.6% 131 8.4% nearestImageVector
102 6.5% 74.2% 102 6.5% Fexclusion
70 4.5% 78.7% 70 4.5% Fdihedral
65 4.2% 82.8% 65 4.2% Fangle
53 3.4% 86.2% 53 3.4% integratorTaskTick
46 2.9% 89.2% 46 2.9% nearestImageDistance
45 2.9% 92.1% 45 2.9% mutiallyExclusivePairForces
32 2.0% 94.1% 32 2.0% FCoulomb
24 1.5% 95.6% 24 1.5% forEveryPairD
17 1.1% 96.7% 17 1.1% calculateForces
14 0.9% 97.6% 14 0.9% atan2
8 0.5% 98.1% 8 0.5% Fstack
8 0.5% 98.7% 8 0.5% pairWrapper
6 0.4% 99.0% 6 0.4% sin
3 0.2% 99.2% 3 0.2% __finite
3 0.2% 99.4% 3 0.2% _init
2 0.1% 99.6% 2 0.1% acos
2 0.1% 99.7% 2 0.1% exp
2 0.1% 99.8% 2 0.1% log
1 0.1% 99.9% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand用gcc:编译的二进制文件的分析输出
Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
352 22.5% 50.6% 352 22.5% nearestImageVector (inline)
263 16.8% 67.5% 263 16.8% cos
131 8.4% 75.8% 131 8.4% measurementTask
52 3.3% 79.2% 331 21.2% Vbond.isra.1.part.2 (inline)
50 3.2% 82.4% 562 36.0% dumpStatsSample.2372 (inline)
46 2.9% 85.3% 46 2.9% nearestImageVector
42 2.7% 88.0% 42 2.7% FdihedralParticle (inline)
33 2.1% 90.1% 59 3.8% Vstack.isra.12.part.13 (inline)
26 1.7% 91.8% 26 1.7% neighbourStackDistance2 (inline)
16 1.0% 92.8% 16 1.0% die
14 0.9% 93.7% 44 2.8% VangleP5SB (inline)
14 0.9% 94.6% 14 0.9% atan2
13 0.8% 95.5% 88 5.6% Vangle.isra.4.part.5 (inline)
12 0.8% 96.2% 343 22.0% Vbond.isra.1 (inline)
8 0.5% 96.7% 8 0.5% printUsage.3788
6 0.4% 97.1% 6 0.4% Fangle.part.8 (inline)
6 0.4% 97.5% 6 0.4% boxFromParticle (inline)
6 0.4% 97.9% 279 17.9% nearestImageDistance (inline)
6 0.4% 98.3% 6 0.4% sin
3 0.2% 98.5% 3 0.2% Vdihedral.isra.9 (inline)
3 0.2% 98.7% 3 0.2% __finite
3 0.2% 98.8% 3 0.2% _init
3 0.2% 99.0% 3 0.2% numParticles (inline)
2 0.1% 99.2% 2 0.1% acos
2 0.1% 99.3% 8 0.5% addToGrid
2 0.1% 99.4% 2 0.1% exp
2 0.1% 99.6% 2 0.1% getAngleBaseInfo (inline)
2 0.1% 99.7% 2 0.1% log
1 0.1% 99.7% 1 0.1% Fbond.part.6 (inline)
1 0.1% 99.8% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 563 36.1% dumpStatsSample.2372
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand
0 0.0% 100.0% 6 0.4% Fangle (inline)
0 0.0% 100.0% 1 0.1% Fbond (inline)
0 0.0% 100.0% 42 2.7% Fdihedral.part.11 (inline)
0 0.0% 100.0% 4 0.3% Fstack.part.14 (inline)
0 0.0% 100.0% 53 3.4% calculateForces.2880
0 0.0% 100.0% 3 0.2% getKineticTemperature (inline)
0 0.0% 100.0% 4 0.3% nearestImageDistance2 (inline)现在,我通过函数指针调用了很多函数,gcc正在用lto和-O4进行编译。我有理由相信,这可能会导致gcc二进制文件的剖析输出受到一定程度的影响。例如,它说有16个样本在“die()”中。然而,这是不可能的,因为该功能立即停止程序!
无论哪种方式,这两个二进制文件似乎都同意在fegetexcept()上花费的28%的时间。我能把所有这些支票都扔掉吗?
其次是,我的完整编译器优化标志如下:
gcc -march=core2 -O4 -flto -mmmx -msse -msse2 -msse3 -fexcess-precision=fast -ffast-math -finline-limit=2000 -fmerge-all-constants -fmodulo-sched -fmodulo-sched-allow-regmoves -fgcse-sm -fgcse-las -fgcse-after-reload -funsafe-loop-optimizations
和
clang -march=core2 -O4
有什么东西我可以补充,以进一步提高性能吗?我不在乎编译时间是否会跃过屋顶,我需要我能得到的每一点性能!(关于clang:我在那里找不到多少特定的性能标志,也许我应该手动转到llvm字节码,然后将标志交给那里的llvm编译器?)
TL;DR:
(1)代码将28%的时间用于feget除()之外。选择“不安全的浮点代码”可以避免这种情况吗?
(2)我能把什么旗子传递给gcc和clang,以获得最大的性能--即使这会把编译时间从屋顶上传过去?
编辑
我将glibc从2.13-r2更新为2.15-r2,现在分析输出已更改为:
嘎吱声:
Total: 1654 samples
381 23.0% 23.0% 381 23.0% __asin_finite
244 14.8% 37.8% 244 14.8% significand
203 12.3% 50.1% 203 12.3% Vdihedral
141 8.5% 58.6% 141 8.5% nearestImageVector
116 7.0% 65.6% 116 7.0% Fexclusion
81 4.9% 70.5% 81 4.9% integratorTaskTick
70 4.2% 74.7% 70 4.2% Fangle
63 3.8% 78.5% 63 3.8% FdihedralParticle
56 3.4% 81.9% 56 3.4% mutiallyExclusivePairForces
45 2.7% 84.6% 45 2.7% FCoulomb
42 2.5% 87.2% 42 2.5% _init
39 2.4% 89.5% 39 2.4% __isinf
35 2.1% 91.7% 35 2.1% nearestImageDistance
29 1.8% 93.4% 29 1.8% __lgamma_r_finite
21 1.3% 94.7% 21 1.3% forEveryPairD
16 1.0% 95.6% 16 1.0% Fbond
13 0.8% 96.4% 13 0.8% __isnan
11 0.7% 97.1% 11 0.7% __cosh_finite
10 0.6% 97.7% 10 0.6% Fstack
10 0.6% 98.3% 10 0.6% __acosh_finite
9 0.5% 98.9% 9 0.5% pairWrapper
6 0.4% 99.2% 6 0.4% atan2
5 0.3% 99.5% 5 0.3% Fdihedral
5 0.3% 99.8% 5 0.3% calculateForces
2 0.1% 99.9% 2 0.1% GLIBC_2.15
1 0.1% 100.0% 1 0.1% expgcc:
Total: 1768 samples
385 21.8% 21.8% 385 21.8% __asin_finite
275 15.6% 37.3% 275 15.6% significand
252 14.3% 51.6% 252 14.3% nearestImageVector
199 11.3% 62.8% 299 16.9% Vdihedral.isra.4.part.5.2808
55 3.1% 66.0% 902 51.0% FdihedralParticle.2836
47 2.7% 68.6% 150 8.5% Fexclusion.part.15 (inline)
44 2.5% 71.1% 87 4.9% FCoulomb.part.16.2891
36 2.0% 73.1% 36 2.0% _init
33 1.9% 75.0% 236 13.3% mutiallyExclusivePairForces.2699
30 1.7% 76.7% 30 1.7% __lgamma_r_finite
29 1.6% 78.3% 29 1.6% isSaneNumber (inline)
28 1.6% 79.9% 28 1.6% feelExclusion (inline)
27 1.5% 81.4% 27 1.5% __isinf
25 1.4% 82.9% 35 2.0% Fangle.part.11.2855
22 1.2% 84.1% 40 2.3% Fangle.part.11 (inline)
22 1.2% 85.4% 24 1.4% randNorm.part.1.3194
20 1.1% 86.5% 20 1.1% __isnan
19 1.1% 87.6% 23 1.3% nearestImageUnitVector (inline)
19 1.1% 88.6% 19 1.1% pairWrapper.3570
18 1.0% 89.6% 105 5.9% langevinBBKhelper.3161
17 1.0% 90.6% 23 1.3% Fbond.part.10 (inline)
15 0.8% 91.5% 20 1.1% Vdihedral.isra.4.part.5 (inline)
14 0.8% 92.3% 14 0.8% length (inline)
13 0.7% 93.0% 13 0.7% getBasePairInfo (inline)
12 0.7% 93.7% 190 10.7% Fexclusion (inline)
12 0.7% 94.3% 15 0.8% Fstack.part.13 (inline)
12 0.7% 95.0% 12 0.7% __acosh_finite
12 0.7% 95.7% 12 0.7% reboxParticles (inline)
9 0.5% 96.2% 23 1.3% randNorm (inline)
8 0.5% 96.7% 8 0.5% __cosh_finite
8 0.5% 97.1% 221 12.5% visitNeighbours.part.1 (inline)
7 0.4% 97.5% 360 20.4% forEveryPairD
7 0.4% 97.9% 7 0.4% sincos
6 0.3% 98.2% 1467 83.0% calculateForces
5 0.3% 98.5% 943 53.3% Fdihedral.part.12 (inline)
4 0.2% 98.8% 33 1.9% debugVectorSanity (inline)
4 0.2% 99.0% 19 1.1% nearestImageDistance (inline)
3 0.2% 99.2% 317 17.9% Vdihedral.isra.4 (inline)
3 0.2% 99.3% 3 0.2% getAngleBaseInfo (inline)
2 0.1% 99.4% 2 0.1% resetForce.2703
2 0.1% 99.5% 2 0.1% tinymt64_generate_doubleOC (inline)
2 0.1% 99.7% 223 12.6% visitNeighbours (inline)
1 0.1% 99.7% 94 5.3% Fangle (inline)
1 0.1% 99.8% 1 0.1% calcInvDebyeLength (inline)
1 0.1% 99.8% 1 0.1% forEveryParticle
1 0.1% 99.9% 131 7.4% forEveryParticleD
1 0.1% 99.9% 1 0.1% munmap
1 0.1% 100.0% 1 0.1% neighbourStackDistance2 (inline)
0 0.0% 100.0% 1 0.1% 0x3e1341e250d56f1d
0 0.0% 100.0% 23 1.3% Fbond (inline)
0 0.0% 100.0% 1690 95.6% __libc_start_main
0 0.0% 100.0% 380 21.5% forEveryPair (inline)
0 0.0% 100.0% 1689 95.5% integratorTaskTick.3198
0 0.0% 100.0% 1690 95.6% main
0 0.0% 100.0% 1690 95.6% run (inline)
0 0.0% 100.0% 1689 95.5% seqTick.2114
0 0.0% 100.0% 1 0.1% taskStop (inline)
0 0.0% 100.0% 1689 95.5% taskTick (inline)所以看起来,这个feget除了可能只是一个错误的名字,它与glibc数学例程中的一些代码进行了识别。我想这是google工具的缺点吧?
然而,我问题的第(2)部分仍然存在:为了获得最大的性能,我能把什么旗帜传递给gcc和clang --即使这会把编译时间从屋顶上传过去?
EDIT2
使用'perf‘(例如,参见https://stackoverflow.com/a/10958510/153105 )提供了一个不错的分析输出。看起来大部分时间都花在atan2()和cos()上,并且使用了sse2版本。为了完整起见,我将添加输出:
# Events: 17K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... .................... .......................................................
#
21.67% hairpin libm-2.15.so [.] __ieee754_atan2_sse2
14.12% hairpin hairpin [.] nearestImageVector
13.94% hairpin libm-2.15.so [.] __cos_sse2
11.94% hairpin hairpin [.] Vdihedral.isra.4.part.5.2808
8.27% hairpin hairpin [.] mutiallyExclusivePairForces.2699
4.81% hairpin hairpin [.] calculateForces
4.45% hairpin hairpin [.] FdihedralParticle.2836
3.89% hairpin hairpin [.] FCoulomb.part.16.2891
2.17% hairpin hairpin [.] langevinBBKhelper.3161
1.85% hairpin hairpin [.] Fangle.part.11.2855
1.83% hairpin libc-2.15.so [.] __isinf
1.64% hairpin hairpin [.] randNorm.part.1.3194
1.45% hairpin libm-2.15.so [.] __ieee754_log_sse2
1.02% hairpin hairpin [.] forEveryPairD
0.93% hairpin libm-2.15.so [.] __ieee754_acos_sse2
0.76% hairpin hairpin [.] pairWrapper.3570
0.76% hairpin hairpin [.] __isnan@plt
0.74% hairpin libc-2.15.so [.] __isnan
0.68% hairpin hairpin [.] __isinf@plt
0.59% hairpin libm-2.15.so [.] __ieee754_exp_sse2
0.58% hairpin libm-2.15.so [.] __sincos
0.55% hairpin hairpin [.] integratorTaskTick.3198
0.29% hairpin hairpin [.] __atan2_finite@plt
0.23% hairpin hairpin [.] cos@plt
0.19% hairpin libm-2.15.so [.] csloww1
0.07% hairpin hairpin [.] resetForce.2703
0.07% hairpin hairpin [.] forEveryParticle
0.06% hairpin libm-2.15.so [.] __dubcos
0.05% hairpin [kernel.kallsyms] [k] mutex_unlock
0.02% hairpin hairpin [.] __log_finite@plt
0.02% hairpin hairpin [.] forEveryParticleD
0.02% hairpin [kernel.kallsyms] [k] do_raw_spin_lock
0.02% hairpin hairpin [.] __acos_finite@plt
0.02% hairpin [kernel.kallsyms] [k] update_cpu_load
0.01% hairpin [kernel.kallsyms] [k] tick_sched_timer
0.01% hairpin [kernel.kallsyms] [k] ktime_get
0.01% hairpin hairpin [.] __exp_finite@plt
0.01% hairpin [kernel.kallsyms] [k] run_timer_softirq
0.01% hairpin [kernel.kallsyms] [k] apic_timer_interrupt
0.01% hairpin [kernel.kallsyms] [k] __cycles_2_ns
0.01% hairpin [kernel.kallsyms] [k] __local_bh_enable
0.01% hairpin [kernel.kallsyms] [k] intel_pmu_disable_all
0.01% hairpin [kernel.kallsyms] [k] r100_mm_rreg
0.01% hairpin [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% hairpin [kernel.kallsyms] [k] update_stats_wait_end.clone.15
0.01% hairpin [kernel.kallsyms] [k] ttwu_do_activate.clone.50
0.01% hairpin [kernel.kallsyms] [k] do_signal
0.01% hairpin [kernel.kallsyms] [k] tty_hung_up_p
0.01% hairpin hairpin [.] main
0.01% hairpin [kernel.kallsyms] [k] prepare_signal
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileData::Evict(ProfileData::Entry const&)
0.01% hairpin [kernel.kallsyms] [k] uhci_check_ports
0.01% hairpin [kernel.kallsyms] [k] copy_siginfo_to_user
0.01% hairpin [kernel.kallsyms] [k] fxrstor_checking
0.01% hairpin [kernel.kallsyms] [k] calc_global_load
0.01% hairpin [kernel.kallsyms] [k] account_group_user_time
0.01% hairpin [kernel.kallsyms] [k] tg_load_down
0.01% hairpin [kernel.kallsyms] [k] irq_enter
0.01% hairpin [kernel.kallsyms] [k] __schedule
0.01% hairpin [kernel.kallsyms] [k] n_tty_write
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileHandler::SignalHandler(int, siginfo*, void*)
0.01% hairpin [kernel.kallsyms] [k] get_cycles
0.01% hairpin [kernel.kallsyms] [k] enqueue_hrtimer
0.01% hairpin hairpin [.] seqTick.2114
0.01% hairpin [kernel.kallsyms] [k] idle_cpu
0.01% hairpin hairpin [.] sincos@plt
0.01% hairpin [kernel.kallsyms] [k] tick_program_event
0.01% hairpin [kernel.kallsyms] [k] clear_page_c
0.01% hairpin [kernel.kallsyms] [k] number.clone.1
0.01% hairpin [kernel.kallsyms] [k] task_waking_fair
0.01% hairpin [kernel.kallsyms] [k] save_i387_xstate
0.01% hairpin [kernel.kallsyms] [k] __rcu_pending
0.01% hairpin [kernel.kallsyms] [k] jiffies_to_timeval
0.01% hairpin [kernel.kallsyms] [k] iowrite16
0.01% hairpin [kernel.kallsyms] [k] hrtimer_interrupt
0.01% hairpin [kernel.kallsyms] [k] finish_task_switch
0.01% hairpin [kernel.kallsyms] [k] clockevents_program_event
0.01% hairpin [kernel.kallsyms] [k] ioread16
0.01% hairpin [kernel.kallsyms] [k] lapic_next_event
0.00% hairpin [kernel.kallsyms] [k] read_tsc
0.00% hairpin [kernel.kallsyms] [k] __zone_watermark_ok
0.00% hairpin libpthread-2.15.so [.] __libc_read
0.00% hairpin [kernel.kallsyms] [k] intel_pmu_enable_all 发布于 2015-02-24 05:20:00
你应该使用perf和Brendan的脚本来创建一个火焰图,这样你就可以得到一个很好的时间去向的视觉表示。火焰图将使人清楚地看到哪些函数是fegetexcept,除非它是一种可视化和总结调用堆栈的方法:
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
没有调用堆栈的CPU采样通常是无用的。
确保安装了所有的符号,因为许多分析器会将样本与最近导出的名称相关联,这可能会导致严重的错误--这可以解释为什么“die”会显示一些示例。
您还可以尝试将您的程序加载到gdb中,并在fegetexcept上设置一个断点。如果您同时安装了libc的符号和源代码,那么您可以沿着调用堆栈走一走,看看为什么要调用fegetexcept。我猜你是在把超出范围的值传递给aco,或者类似的东西。
本文讨论如何安装libc的符号和源代码。
https://randomascii.wordpress.com/2013/01/08/symbols-on-linux-part-one-g-library-symbols/
发布于 2012-09-15 16:10:19
您的源代码不包含对fegetexcept( )的任何调用,这意味着您使用的某个库函数正在调用它。可能是系统数学库中的一个或多个函数,根据您的其他抽样数据来判断。
您能尝试添加-fno-math-errno吗?在某些平台上,这将有助于避免不必要的FP环境操作。
https://stackoverflow.com/questions/12438726
复制相似问题