我想计算已实现的占用率,并将其与Nsight Compute中显示的值进行比较。ncu说:Theoretical Occupancy [%] 100和Achieved Occupancy [%] 93,04。我需要什么参数来计算这个值?
我可以看到使用占用api的理论占用率,它以1.0或100%的形式出现。
我试着寻找度量指标achieved_occupancy,sm__active_warps_sum,sm__actice_cycles_sum,但是他们都说:Failed to find metric sm__active_warps_sum。我可以看到formaula来计算从这获得的占用率,所以答案。
如果这可能有帮助的话,那就少得多了:
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 7
Minor revision number: 5
Name: NVIDIA GeForce GTX 1650
Total global memory: 4093181952
Total constant memory: 65536
Total shared memory per block: 49152
Total registers per block: 65536
Total registers per multiprocessor: 65536
Warp size: 32
Maximum threads per block: 1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor: 16
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1515000
Maximum memory pitch: 2147483647
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: Yes
ptxas info : Used 18 registers, 360 bytes cmem[0]发布于 2022-10-29 17:08:21
较短:
简单地说,理论占用率是用度量名sm__maximum_warps_per_active_cycle_pct表示的,而达到的占用率则是用度量名sm__warps_active.avg.pct_of_peak_sustained_active表示的。
较长时间:
您所指出的指标:
我试着寻找度量achieved_occupancy、sm__active_warps_sum、sm__active_cycles_sum,但他们都说:未能找到度量sm__active_warps_sum。
不适用于nsight计算。NVIDIA制作了各种不同的分析器,这些度量名称适用于其他分析器。您所引用的文章引用了不同的分析器( windows上的原始分析器使用了nsight名称,但不是nsight计算)。
这个博客文章讨论了获得有效的nsight计算度量名的不同方法,其中引用了以不同方式表示度量的文档链接。
我还要为其他人指出,nsight计算有专用于占用的整个报告部分,因此对于典型的兴趣而言,这可能是最简单的方法。关于如何运行nsight计算的其他说明可在这个博客中获得。
要想出代表nsight计算设计人员所期望的占用率的指标,我的建议是看看他们的定义。nsight计算中的每个报表部分都有“人类可读的”文件,这些文件指示该部分是如何组装的。因为有一个报告部分包括报告理论和已实现的占用率,所以我们可以通过检查这些文件来发现如何计算这些占用率。
如何计算占用部分的方法包含在两个文件中,这些文件是CUDA安装的一部分。在标准的linux安装上,这些安装将在/usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py和.../sections/Occupancy.section中进行。python文件给出了所使用的度量标准的确切名称,以及与占用相关的其他显示主题(例如注释、警告等)的计算方法。简单地说,理论占用率是用度量名sm__maximum_warps_per_active_cycle_pct表示的,而达到的占用率则是用度量名sm__warps_active.avg.pct_of_peak_sustained_active表示的。
您可以使用如下命令行检索占用率部分报告(这是“默认”“set”的一部分),以及这些特定指标:
ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./my-app下面是这样一个运行的输出示例:
$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./t2140
Testing with mask size = 3
==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.
________________________________________________________________________
Testing with mask size = 5
==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.
________________________________________________________________________
Testing with mask size = 7
==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.
________________________________________________________________________
Testing with mask size = 9
==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.
________________________________________________________________________
==PROF== Disconnected from process 31551
[31551] t2140@127.0.0.1
convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 50
sm__warps_active.avg.pct_of_peak_sustained_active % 40.42
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 815.21
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 47,929
Memory [%] % 23.96
DRAM Throughput % 15.23
Duration usecond 42.08
L1/TEX Cache Throughput % 26.90
L2 Cache Throughput % 10.54
SM Active Cycles cycle 42,619.88
Compute (SM) [%] % 37.09
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 38
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1,048,576
Waves Per SM 12.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 1
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 40.42
Achieved Active Warps Per SM warp 25.87
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 100
sm__warps_active.avg.pct_of_peak_sustained_active % 84.01
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 771.98
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 31,103
Memory [%] % 40.61
DRAM Throughput % 24.83
Duration usecond 29.12
L1/TEX Cache Throughput % 46.39
L2 Cache Throughput % 18.43
SM Active Cycles cycle 27,168.03
Compute (SM) [%] % 60.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,156
Registers Per Thread register/thread 31
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 4.10
Threads thread 1,183,744
Waves Per SM 7.22
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 24
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 84.01
Achieved Active Warps Per SM warp 53.77
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
as well as across blocks of the same kernel.
<sections repeat for each kernel launch>
$https://stackoverflow.com/questions/74245226
复制相似问题