首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >UVA中的PCI-e事务

UVA中的PCI-e事务
EN

Stack Overflow用户
提问于 2014-07-14 18:26:22
回答 1查看 260关注 0票数 0

在CUDA的统一虚拟寻址(UVA)中,来自CPU-GPU的内存复制调用是在内部计划的,反之亦然。但是,nvprof cuda分析器不报告UVA的PCI-e总线事务。是否有办法了解主机与设备之间以及设备与主机之间正在进行的数据传输?

EN

回答 1

Stack Overflow用户

发布于 2014-07-14 22:19:02

是的,可以让nvprof报告Unified Memory活动。您可能希望使用以下命令研究可用的选项

代码语言:javascript
复制
nvprof --help

如果组合使用--print-gpu-trace--unified-memory-profiling per-process-device选项,应该会得到一些指示UM活动的结果。

下面是一个示例:

代码语言:javascript
复制
$ cat t476.cu
#include <stdio.h>
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void mykernel(int *d_data){

  printf("Data = %d\n", *d_data);
  *d_data = 0;
}

int main(){

  cudaDeviceProp myprop;
  int mydevice;
  int numdevices;
  cudaGetDeviceCount(&numdevices);
  cudaCheckErrors("get dev count fail");
  for (mydevice = 0; mydevice < numdevices; mydevice++){
    cudaGetDeviceProperties(&myprop, mydevice);
    printf("device %d: %s\n", mydevice, myprop.name);
    printf("device %d supports unified addressing: ", mydevice);
    if (myprop.unifiedAddressing) printf(" yes\n");
    else printf("  no\n");
    printf("device %d supports managed memory: ", mydevice);
    if (myprop.managedMemory) printf(" yes\n");
    else printf("  no\n");
    }
  cudaSetDevice(--mydevice);
  printf("using device %d\n", mydevice);
  int h_data = 1;
  int *d_data;
  cudaMalloc(&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d\n", h_data);
  printf("now testing managed memory\n");
  int *m_data;
  cudaMallocManaged(&m_data, sizeof(int));
  cudaCheckErrors("managed mem fail");
  *m_data = 1;
  mykernel<<<1,1>>>(m_data);
  cudaDeviceSynchronize();
  printf("data = %d\n", m_data);
  cudaCheckErrors("some error");
  return 0;
}
$ nvcc -arch=sm_35 -o t476 t476.cu                                                                             
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./t476
==5114== NVPROF is profiling process 5114, command: ./t476
device 0: GeForce GT 640
device 0 supports unified addressing:  yes
device 0 supports managed memory:  yes
using device 0
Data = 1
data = 0
now testing managed memory
Data = 1
data = 0
==5114== Profiling application: ./t476
==5114== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream             Unified Memory  Name
1.10622s  1.1200us                    -               -         -         -         -        4B  3.5714MB/s  GeForce GT 640          1         7                          -  [CUDA memcpy HtoD]
1.10687s  64.481us              (1 1 1)         (1 1 1)        32        0B        0B         -           -  GeForce GT 640          1         7                          -  mykernel(int*) [102]
1.10693s  2.3360us                    -               -         -         -         -        4B  1.7123MB/s  GeForce GT 640          1         7                          -  [CUDA memcpy DtoH]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                          0  [Unified Memory CPU page faults]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                        0 B  [Unified Memory Memcpy DtoH]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                        0 B  [Unified Memory Memcpy HtoD]
1.12590s  64.097us              (1 1 1)         (1 1 1)        32        0B        0B         -           -  GeForce GT 640          1         7                          -  mykernel(int*) [108]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                     4096 B  [Unified Memory Memcpy DtoH]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                     4096 B  [Unified Memory Memcpy HtoD]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                          1  [Unified Memory CPU page faults]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/24734445

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档