首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >分析数字处理代码: 28%的时间在feget以外()&最佳编译器标志?

分析数字处理代码: 28%的时间在feget以外()&最佳编译器标志?
EN

Stack Overflow用户
提问于 2012-09-15 15:39:15
回答 2查看 456关注 0票数 3

我正在运行DNA链的模拟,涉及到大量的浮点数运算。完整的代码在这里:https://github.com/RoaldFre/DNA

在与gcc和clang一起编译后,我用google-perftools做了一些分析。在这两种情况下,谷歌工具都表示,大约28%的时间都花在fegetexcept()上。这似乎是查询CPU浮点异常标志的C库的一个函数。

请注意,我用的是gcc的数学,如果我没有弄错的话,应该忽略(全部?)浮点例外!我还在使用-O4和clang (是否有一个单独的标志来启用不安全的浮点指令)。

用clang:编译的二进制文件的分析输出

代码语言:javascript
复制
Total: 1561 samples
     438  28.1%  28.1%      438  28.1% fegetexcept
     263  16.8%  44.9%      263  16.8% cos
     224  14.3%  59.3%      224  14.3% Vdihedral
     131   8.4%  67.6%      131   8.4% nearestImageVector
     102   6.5%  74.2%      102   6.5% Fexclusion
      70   4.5%  78.7%       70   4.5% Fdihedral
      65   4.2%  82.8%       65   4.2% Fangle
      53   3.4%  86.2%       53   3.4% integratorTaskTick
      46   2.9%  89.2%       46   2.9% nearestImageDistance
      45   2.9%  92.1%       45   2.9% mutiallyExclusivePairForces
      32   2.0%  94.1%       32   2.0% FCoulomb
      24   1.5%  95.6%       24   1.5% forEveryPairD
      17   1.1%  96.7%       17   1.1% calculateForces
      14   0.9%  97.6%       14   0.9% atan2
       8   0.5%  98.1%        8   0.5% Fstack
       8   0.5%  98.7%        8   0.5% pairWrapper
       6   0.4%  99.0%        6   0.4% sin
       3   0.2%  99.2%        3   0.2% __finite
       3   0.2%  99.4%        3   0.2% _init
       2   0.1%  99.6%        2   0.1% acos
       2   0.1%  99.7%        2   0.1% exp
       2   0.1%  99.8%        2   0.1% log
       1   0.1%  99.9%        1   0.1% _IO_file_xsputn
       1   0.1%  99.9%        1   0.1% log2
       1   0.1% 100.0%        1   0.1% significand

用gcc:编译的二进制文件的分析输出

代码语言:javascript
复制
Total: 1561 samples
     438  28.1%  28.1%      438  28.1% fegetexcept
     352  22.5%  50.6%      352  22.5% nearestImageVector (inline)
     263  16.8%  67.5%      263  16.8% cos
     131   8.4%  75.8%      131   8.4% measurementTask
      52   3.3%  79.2%      331  21.2% Vbond.isra.1.part.2 (inline)
      50   3.2%  82.4%      562  36.0% dumpStatsSample.2372 (inline)
      46   2.9%  85.3%       46   2.9% nearestImageVector
      42   2.7%  88.0%       42   2.7% FdihedralParticle (inline)
      33   2.1%  90.1%       59   3.8% Vstack.isra.12.part.13 (inline)
      26   1.7%  91.8%       26   1.7% neighbourStackDistance2 (inline)
      16   1.0%  92.8%       16   1.0% die
      14   0.9%  93.7%       44   2.8% VangleP5SB (inline)
      14   0.9%  94.6%       14   0.9% atan2
      13   0.8%  95.5%       88   5.6% Vangle.isra.4.part.5 (inline)
      12   0.8%  96.2%      343  22.0% Vbond.isra.1 (inline)
       8   0.5%  96.7%        8   0.5% printUsage.3788
       6   0.4%  97.1%        6   0.4% Fangle.part.8 (inline)
       6   0.4%  97.5%        6   0.4% boxFromParticle (inline)
       6   0.4%  97.9%      279  17.9% nearestImageDistance (inline)
       6   0.4%  98.3%        6   0.4% sin
       3   0.2%  98.5%        3   0.2% Vdihedral.isra.9 (inline)
       3   0.2%  98.7%        3   0.2% __finite
       3   0.2%  98.8%        3   0.2% _init
       3   0.2%  99.0%        3   0.2% numParticles (inline)
       2   0.1%  99.2%        2   0.1% acos
       2   0.1%  99.3%        8   0.5% addToGrid
       2   0.1%  99.4%        2   0.1% exp
       2   0.1%  99.6%        2   0.1% getAngleBaseInfo (inline)
       2   0.1%  99.7%        2   0.1% log
       1   0.1%  99.7%        1   0.1% Fbond.part.6 (inline)
       1   0.1%  99.8%        1   0.1% _IO_file_xsputn
       1   0.1%  99.9%      563  36.1% dumpStatsSample.2372
       1   0.1%  99.9%        1   0.1% log2
       1   0.1% 100.0%        1   0.1% significand
       0   0.0% 100.0%        6   0.4% Fangle (inline)
       0   0.0% 100.0%        1   0.1% Fbond (inline)
       0   0.0% 100.0%       42   2.7% Fdihedral.part.11 (inline)
       0   0.0% 100.0%        4   0.3% Fstack.part.14 (inline)
       0   0.0% 100.0%       53   3.4% calculateForces.2880
       0   0.0% 100.0%        3   0.2% getKineticTemperature (inline)
       0   0.0% 100.0%        4   0.3% nearestImageDistance2 (inline)

现在,我通过函数指针调用了很多函数,gcc正在用lto和-O4进行编译。我有理由相信,这可能会导致gcc二进制文件的剖析输出受到一定程度的影响。例如,它说有16个样本在“die()”中。然而,这是不可能的,因为该功能立即停止程序!

无论哪种方式,这两个二进制文件似乎都同意在fegetexcept()上花费的28%的时间。我能把所有这些支票都扔掉吗?

其次是,我的完整编译器优化标志如下:

gcc -march=core2 -O4 -flto -mmmx -msse -msse2 -msse3 -fexcess-precision=fast -ffast-math -finline-limit=2000 -fmerge-all-constants -fmodulo-sched -fmodulo-sched-allow-regmoves -fgcse-sm -fgcse-las -fgcse-after-reload -funsafe-loop-optimizations

clang -march=core2 -O4

有什么东西我可以补充,以进一步提高性能吗?我不在乎编译时间是否会跃过屋顶,我需要我能得到的每一点性能!(关于clang:我在那里找不到多少特定的性能标志,也许我应该手动转到llvm字节码,然后将标志交给那里的llvm编译器?)

TL;DR:

(1)代码将28%的时间用于feget除()之外。选择“不安全的浮点代码”可以避免这种情况吗?

(2)我能把什么旗子传递给gcc和clang,以获得最大的性能--即使这会把编译时间从屋顶上传过去?

编辑

我将glibc从2.13-r2更新为2.15-r2,现在分析输出已更改为:

嘎吱声:

代码语言:javascript
复制
Total: 1654 samples
     381  23.0%  23.0%      381  23.0% __asin_finite
     244  14.8%  37.8%      244  14.8% significand
     203  12.3%  50.1%      203  12.3% Vdihedral
     141   8.5%  58.6%      141   8.5% nearestImageVector
     116   7.0%  65.6%      116   7.0% Fexclusion
      81   4.9%  70.5%       81   4.9% integratorTaskTick
      70   4.2%  74.7%       70   4.2% Fangle
      63   3.8%  78.5%       63   3.8% FdihedralParticle
      56   3.4%  81.9%       56   3.4% mutiallyExclusivePairForces
      45   2.7%  84.6%       45   2.7% FCoulomb
      42   2.5%  87.2%       42   2.5% _init
      39   2.4%  89.5%       39   2.4% __isinf
      35   2.1%  91.7%       35   2.1% nearestImageDistance
      29   1.8%  93.4%       29   1.8% __lgamma_r_finite
      21   1.3%  94.7%       21   1.3% forEveryPairD
      16   1.0%  95.6%       16   1.0% Fbond
      13   0.8%  96.4%       13   0.8% __isnan
      11   0.7%  97.1%       11   0.7% __cosh_finite
      10   0.6%  97.7%       10   0.6% Fstack
      10   0.6%  98.3%       10   0.6% __acosh_finite
       9   0.5%  98.9%        9   0.5% pairWrapper
       6   0.4%  99.2%        6   0.4% atan2
       5   0.3%  99.5%        5   0.3% Fdihedral
       5   0.3%  99.8%        5   0.3% calculateForces
       2   0.1%  99.9%        2   0.1% GLIBC_2.15
       1   0.1% 100.0%        1   0.1% exp

gcc:

代码语言:javascript
复制
Total: 1768 samples
     385  21.8%  21.8%      385  21.8% __asin_finite
     275  15.6%  37.3%      275  15.6% significand
     252  14.3%  51.6%      252  14.3% nearestImageVector
     199  11.3%  62.8%      299  16.9% Vdihedral.isra.4.part.5.2808
      55   3.1%  66.0%      902  51.0% FdihedralParticle.2836
      47   2.7%  68.6%      150   8.5% Fexclusion.part.15 (inline)
      44   2.5%  71.1%       87   4.9% FCoulomb.part.16.2891
      36   2.0%  73.1%       36   2.0% _init
      33   1.9%  75.0%      236  13.3% mutiallyExclusivePairForces.2699
      30   1.7%  76.7%       30   1.7% __lgamma_r_finite
      29   1.6%  78.3%       29   1.6% isSaneNumber (inline)
      28   1.6%  79.9%       28   1.6% feelExclusion (inline)
      27   1.5%  81.4%       27   1.5% __isinf
      25   1.4%  82.9%       35   2.0% Fangle.part.11.2855
      22   1.2%  84.1%       40   2.3% Fangle.part.11 (inline)
      22   1.2%  85.4%       24   1.4% randNorm.part.1.3194
      20   1.1%  86.5%       20   1.1% __isnan
      19   1.1%  87.6%       23   1.3% nearestImageUnitVector (inline)
      19   1.1%  88.6%       19   1.1% pairWrapper.3570
      18   1.0%  89.6%      105   5.9% langevinBBKhelper.3161
      17   1.0%  90.6%       23   1.3% Fbond.part.10 (inline)
      15   0.8%  91.5%       20   1.1% Vdihedral.isra.4.part.5 (inline)
      14   0.8%  92.3%       14   0.8% length (inline)
      13   0.7%  93.0%       13   0.7% getBasePairInfo (inline)
      12   0.7%  93.7%      190  10.7% Fexclusion (inline)
      12   0.7%  94.3%       15   0.8% Fstack.part.13 (inline)
      12   0.7%  95.0%       12   0.7% __acosh_finite
      12   0.7%  95.7%       12   0.7% reboxParticles (inline)
       9   0.5%  96.2%       23   1.3% randNorm (inline)
       8   0.5%  96.7%        8   0.5% __cosh_finite
       8   0.5%  97.1%      221  12.5% visitNeighbours.part.1 (inline)
       7   0.4%  97.5%      360  20.4% forEveryPairD
       7   0.4%  97.9%        7   0.4% sincos
       6   0.3%  98.2%     1467  83.0% calculateForces
       5   0.3%  98.5%      943  53.3% Fdihedral.part.12 (inline)
       4   0.2%  98.8%       33   1.9% debugVectorSanity (inline)
       4   0.2%  99.0%       19   1.1% nearestImageDistance (inline)
       3   0.2%  99.2%      317  17.9% Vdihedral.isra.4 (inline)
       3   0.2%  99.3%        3   0.2% getAngleBaseInfo (inline)
       2   0.1%  99.4%        2   0.1% resetForce.2703
       2   0.1%  99.5%        2   0.1% tinymt64_generate_doubleOC (inline)
       2   0.1%  99.7%      223  12.6% visitNeighbours (inline)
       1   0.1%  99.7%       94   5.3% Fangle (inline)
       1   0.1%  99.8%        1   0.1% calcInvDebyeLength (inline)
       1   0.1%  99.8%        1   0.1% forEveryParticle
       1   0.1%  99.9%      131   7.4% forEveryParticleD
       1   0.1%  99.9%        1   0.1% munmap
       1   0.1% 100.0%        1   0.1% neighbourStackDistance2 (inline)
       0   0.0% 100.0%        1   0.1% 0x3e1341e250d56f1d
       0   0.0% 100.0%       23   1.3% Fbond (inline)
       0   0.0% 100.0%     1690  95.6% __libc_start_main
       0   0.0% 100.0%      380  21.5% forEveryPair (inline)
       0   0.0% 100.0%     1689  95.5% integratorTaskTick.3198
       0   0.0% 100.0%     1690  95.6% main
       0   0.0% 100.0%     1690  95.6% run (inline)
       0   0.0% 100.0%     1689  95.5% seqTick.2114
       0   0.0% 100.0%        1   0.1% taskStop (inline)
       0   0.0% 100.0%     1689  95.5% taskTick (inline)

所以看起来,这个feget除了可能只是一个错误的名字,它与glibc数学例程中的一些代码进行了识别。我想这是google工具的缺点吧?

然而,我问题的第(2)部分仍然存在:为了获得最大的性能,我能把什么旗帜传递给gcc和clang --即使这会把编译时间从屋顶上传过去?

EDIT2

使用'perf‘(例如,参见https://stackoverflow.com/a/10958510/153105 )提供了一个不错的分析输出。看起来大部分时间都花在atan2()和cos()上,并且使用了sse2版本。为了完整起见,我将添加输出:

代码语言:javascript
复制
# Events: 17K cycles
#
# Overhead  Command         Shared Object                                                   Symbol
# ........  .......  ....................  .......................................................
#
    21.67%  hairpin  libm-2.15.so          [.] __ieee754_atan2_sse2                               
    14.12%  hairpin  hairpin               [.] nearestImageVector                                 
    13.94%  hairpin  libm-2.15.so          [.] __cos_sse2                                         
    11.94%  hairpin  hairpin               [.] Vdihedral.isra.4.part.5.2808                       
     8.27%  hairpin  hairpin               [.] mutiallyExclusivePairForces.2699                   
     4.81%  hairpin  hairpin               [.] calculateForces                                    
     4.45%  hairpin  hairpin               [.] FdihedralParticle.2836                             
     3.89%  hairpin  hairpin               [.] FCoulomb.part.16.2891                              
     2.17%  hairpin  hairpin               [.] langevinBBKhelper.3161                             
     1.85%  hairpin  hairpin               [.] Fangle.part.11.2855                                
     1.83%  hairpin  libc-2.15.so          [.] __isinf                                            
     1.64%  hairpin  hairpin               [.] randNorm.part.1.3194                               
     1.45%  hairpin  libm-2.15.so          [.] __ieee754_log_sse2                                 
     1.02%  hairpin  hairpin               [.] forEveryPairD                                      
     0.93%  hairpin  libm-2.15.so          [.] __ieee754_acos_sse2                                
     0.76%  hairpin  hairpin               [.] pairWrapper.3570                                   
     0.76%  hairpin  hairpin               [.] __isnan@plt                                        
     0.74%  hairpin  libc-2.15.so          [.] __isnan                                            
     0.68%  hairpin  hairpin               [.] __isinf@plt                                        
     0.59%  hairpin  libm-2.15.so          [.] __ieee754_exp_sse2                                 
     0.58%  hairpin  libm-2.15.so          [.] __sincos                                           
     0.55%  hairpin  hairpin               [.] integratorTaskTick.3198                            
     0.29%  hairpin  hairpin               [.] __atan2_finite@plt                                 
     0.23%  hairpin  hairpin               [.] cos@plt                                            
     0.19%  hairpin  libm-2.15.so          [.] csloww1                                            
     0.07%  hairpin  hairpin               [.] resetForce.2703                                    
     0.07%  hairpin  hairpin               [.] forEveryParticle                                   
     0.06%  hairpin  libm-2.15.so          [.] __dubcos                                           
     0.05%  hairpin  [kernel.kallsyms]     [k] mutex_unlock                                       
     0.02%  hairpin  hairpin               [.] __log_finite@plt                                   
     0.02%  hairpin  hairpin               [.] forEveryParticleD                                  
     0.02%  hairpin  [kernel.kallsyms]     [k] do_raw_spin_lock                                   
     0.02%  hairpin  hairpin               [.] __acos_finite@plt                                  
     0.02%  hairpin  [kernel.kallsyms]     [k] update_cpu_load                                    
     0.01%  hairpin  [kernel.kallsyms]     [k] tick_sched_timer                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] ktime_get                                          
     0.01%  hairpin  hairpin               [.] __exp_finite@plt                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] run_timer_softirq                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] apic_timer_interrupt                               
     0.01%  hairpin  [kernel.kallsyms]     [k] __cycles_2_ns                                      
     0.01%  hairpin  [kernel.kallsyms]     [k] __local_bh_enable                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] intel_pmu_disable_all                              
     0.01%  hairpin  [kernel.kallsyms]     [k] r100_mm_rreg                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] perf_adjust_freq_unthr_context                     
     0.01%  hairpin  [kernel.kallsyms]     [k] update_stats_wait_end.clone.15                     
     0.01%  hairpin  [kernel.kallsyms]     [k] ttwu_do_activate.clone.50                          
     0.01%  hairpin  [kernel.kallsyms]     [k] do_signal                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] tty_hung_up_p                                      
     0.01%  hairpin  hairpin               [.] main                                               
     0.01%  hairpin  [kernel.kallsyms]     [k] prepare_signal                                     
     0.01%  hairpin  libprofiler.so.0.3.0  [.] ProfileData::Evict(ProfileData::Entry const&)      
     0.01%  hairpin  [kernel.kallsyms]     [k] uhci_check_ports                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] copy_siginfo_to_user                               
     0.01%  hairpin  [kernel.kallsyms]     [k] fxrstor_checking                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] calc_global_load                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] account_group_user_time                            
     0.01%  hairpin  [kernel.kallsyms]     [k] tg_load_down                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] irq_enter                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] __schedule                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] n_tty_write                                        
     0.01%  hairpin  libprofiler.so.0.3.0  [.] ProfileHandler::SignalHandler(int, siginfo*, void*)
     0.01%  hairpin  [kernel.kallsyms]     [k] get_cycles                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] enqueue_hrtimer                                    
     0.01%  hairpin  hairpin               [.] seqTick.2114                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] idle_cpu                                           
     0.01%  hairpin  hairpin               [.] sincos@plt                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] tick_program_event                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] clear_page_c                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] number.clone.1                                     
     0.01%  hairpin  [kernel.kallsyms]     [k] task_waking_fair                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] save_i387_xstate                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] __rcu_pending                                      
     0.01%  hairpin  [kernel.kallsyms]     [k] jiffies_to_timeval                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] iowrite16                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] hrtimer_interrupt                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] finish_task_switch                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] clockevents_program_event                          
     0.01%  hairpin  [kernel.kallsyms]     [k] ioread16                                           
     0.01%  hairpin  [kernel.kallsyms]     [k] lapic_next_event                                   
     0.00%  hairpin  [kernel.kallsyms]     [k] read_tsc                                           
     0.00%  hairpin  [kernel.kallsyms]     [k] __zone_watermark_ok                                
     0.00%  hairpin  libpthread-2.15.so    [.] __libc_read                                        
     0.00%  hairpin  [kernel.kallsyms]     [k] intel_pmu_enable_all                               
EN

回答 2

Stack Overflow用户

发布于 2015-02-24 05:20:00

你应该使用perf和Brendan的脚本来创建一个火焰图,这样你就可以得到一个很好的时间去向的视觉表示。火焰图将使人清楚地看到哪些函数是fegetexcept,除非它是一种可视化和总结调用堆栈的方法:

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

没有调用堆栈的CPU采样通常是无用的。

确保安装了所有的符号,因为许多分析器会将样本与最近导出的名称相关联,这可能会导致严重的错误--这可以解释为什么“die”会显示一些示例。

您还可以尝试将您的程序加载到gdb中,并在fegetexcept上设置一个断点。如果您同时安装了libc的符号和源代码,那么您可以沿着调用堆栈走一走,看看为什么要调用fegetexcept。我猜你是在把超出范围的值传递给aco,或者类似的东西。

本文讨论如何安装libc的符号和源代码。

https://randomascii.wordpress.com/2013/01/08/symbols-on-linux-part-one-g-library-symbols/

票数 1
EN

Stack Overflow用户

发布于 2012-09-15 16:10:19

您的源代码不包含对fegetexcept( )的任何调用,这意味着您使用的某个库函数正在调用它。可能是系统数学库中的一个或多个函数,根据您的其他抽样数据来判断。

您能尝试添加-fno-math-errno吗?在某些平台上,这将有助于避免不必要的FP环境操作。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/12438726

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档