文章/答案/技术大牛

发布

社区首页 >问答首页 >NSight计算没有在度量中显示已达到的占用率

问NSight计算没有在度量中显示已达到的占用率
EN

Stack Overflow用户

提问于 2022-10-29 12:22:11

回答 1查看 64关注 0票数 0

我想计算已实现的占用率，并将其与Nsight Compute中显示的值进行比较。ncu说：Theoretical Occupancy [%] 100和Achieved Occupancy [%] 93,04。我需要什么参数来计算这个值？

我可以看到使用占用api的理论占用率，它以1.0或100%的形式出现。

我试着寻找度量指标achieved_occupancy，sm__active_warps_sum，sm__actice_cycles_sum，但是他们都说：Failed to find metric sm__active_warps_sum。我可以看到formaula来计算从这获得的占用率，所以答案。

如果这可能有帮助的话，那就少得多了：

There are 1 CUDA devices.

CUDA Device #0
Major revision number:         7
Minor revision number:         5
Name:                          NVIDIA GeForce GTX 1650
Total global memory:           4093181952
Total constant memory:         65536
Total shared memory per block: 49152
Total registers per block:     65536
Total registers per multiprocessor: 65536
Warp size:                     32
Maximum threads per block:     1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor:     16
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   2147483647
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1515000
Maximum memory pitch:          2147483647
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     14
Kernel execution timeout:      Yes

ptxas info    : Used 18 registers, 360 bytes cmem[0]

cuda

profiling

nvidia

nsight

回答 1

Stack Overflow用户

发布于 2022-10-29 17:08:21

较短：

简单地说，理论占用率是用度量名sm__maximum_warps_per_active_cycle_pct表示的，而达到的占用率则是用度量名sm__warps_active.avg.pct_of_peak_sustained_active表示的。

较长时间：

您所指出的指标：

我试着寻找度量achieved_occupancy、sm__active_warps_sum、sm__active_cycles_sum，但他们都说:未能找到度量sm__active_warps_sum。

不适用于nsight计算。NVIDIA制作了各种不同的分析器，这些度量名称适用于其他分析器。您所引用的文章引用了不同的分析器( windows上的原始分析器使用了nsight名称，但不是nsight计算)。

这个博客文章讨论了获得有效的nsight计算度量名的不同方法，其中引用了以不同方式表示度量的文档链接。

我还要为其他人指出，nsight计算有专用于占用的整个报告部分，因此对于典型的兴趣而言，这可能是最简单的方法。关于如何运行nsight计算的其他说明可在这个博客中获得。

要想出代表nsight计算设计人员所期望的占用率的指标，我的建议是看看他们的定义。nsight计算中的每个报表部分都有“人类可读的”文件，这些文件指示该部分是如何组装的。因为有一个报告部分包括报告理论和已实现的占用率，所以我们可以通过检查这些文件来发现如何计算这些占用率。

如何计算占用部分的方法包含在两个文件中，这些文件是CUDA安装的一部分。在标准的linux安装上，这些安装将在/usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py和.../sections/Occupancy.section中进行。python文件给出了所使用的度量标准的确切名称，以及与占用相关的其他显示主题(例如注释、警告等)的计算方法。简单地说，理论占用率是用度量名sm__maximum_warps_per_active_cycle_pct表示的，而达到的占用率则是用度量名sm__warps_active.avg.pct_of_peak_sustained_active表示的。

您可以使用如下命令行检索占用率部分报告(这是“默认”“set”的一部分)，以及这些特定指标：

ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active  ./my-app

下面是这样一个运行的输出示例：

$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active  ./t2140
Testing with mask size = 3

==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.

________________________________________________________________________

Testing with mask size = 5

==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.

________________________________________________________________________

Testing with mask size = 7

==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.

________________________________________________________________________

Testing with mask size = 9

==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.

________________________________________________________________________

==PROF== Disconnected from process 31551
[31551] t2140@127.0.0.1
  convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__maximum_warps_per_active_cycle_pct                                               %                             50
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          40.42
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         815.21
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                         47,929
    Memory [%]                                                                           %                          23.96
    DRAM Throughput                                                                      %                          15.23
    Duration                                                                       usecond                          42.08
    L1/TEX Cache Throughput                                                              %                          26.90
    L2 Cache Throughput                                                                  %                          10.54
    SM Active Cycles                                                                 cycle                      42,619.88
    Compute (SM) [%]                                                                     %                          37.09
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1,024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             38
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      1,048,576
    Waves Per SM                                                                                                    12.80
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              1
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                             50
    Achieved Occupancy                                                                   %                          40.42
    Achieved Active Warps Per SM                                                      warp                          25.87
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (50.0%) is limited by the number of required registers

  convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__maximum_warps_per_active_cycle_pct                                               %                            100
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          84.01
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         771.98
    SM Frequency                                                             cycle/nsecond                           1.07
    Elapsed Cycles                                                                   cycle                         31,103
    Memory [%]                                                                           %                          40.61
    DRAM Throughput                                                                      %                          24.83
    Duration                                                                       usecond                          29.12
    L1/TEX Cache Throughput                                                              %                          46.39
    L2 Cache Throughput                                                                  %                          18.43
    SM Active Cycles                                                                 cycle                      27,168.03
    Compute (SM) [%]                                                                     %                          60.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
          what the compute pipelines are spending their time doing. Also, consider whether any computation is
          redundant and could be reduced or moved to look-up tables.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1,024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,156
    Registers Per Thread                                                   register/thread                             31
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           4.10
    Threads                                                                         thread                      1,183,744
    Waves Per SM                                                                                                     7.22
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                             24
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          84.01
    Achieved Active Warps Per SM                                                      warp                          53.77
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
          theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
          as well as across blocks of the same kernel.

<sections repeat for each kernel launch>
$

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74245226

复制

相似问题

问NSight计算没有在度量中显示已达到的占用率
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NSight计算没有在度量中显示已达到的占用率EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NSight计算没有在度量中显示已达到的占用率
EN