首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >行为不当的NVLINK与2080 ti卡?

行为不当的NVLINK与2080 ti卡?
EN

Server Fault用户
提问于 2020-05-08 14:34:50
回答 1查看 1.3K关注 0票数 1

我遇到了nvlink的RTX录像机的问题,我想知道是否有一个更有经验的技术人员可以看看下面的输出,并告诉我是否有问题?

使用一对MSI 2080 ti卡和ASUS的RTX NVLINK桥,Ryzen/X 370系统,运行Ubuntu18.04 Linux和几个版本的Nvidia驱动程序。

Nvidia-smi调用运行非常慢,caffe和CUDA示例程序行为不当。

像Caffe这样的程序在两个gpu上运行时行为都很糟糕(即使用caffe 0,1)。安装和脚手架可能需要20分钟才能完成(对于在一个GPU上只需几秒钟就能站起来的googlenet来说),然后训练有时会以预期的方式进行,或者经过几次迭代就会结冰。

我看到了下面的输出,这似乎是错误的。我搞错了吗?

这个输出是奇怪的还是我的理解不正确?任何帮助都非常感谢!

我正在运行nvidia-在我的用户帐户ID下作为守护进程持久化。

细节..。

代码语言:javascript
复制
$ nvidia-smi -L    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
代码语言:javascript
复制
$ nvidia-smi nvlink --status    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0: 25.781 GB/s
         Link 1: 
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0: 25.781 GB/s
         Link 1: 

两个链接(0和1)不是都是活动的吗?

代码语言:javascript
复制
$ nvidia-smi nvlink -c     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

这里不应该同时有链接0和链接1吗?

代码语言:javascript
复制
$ nvidia-smi nvlink --capabilities     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

这里不应该同时有链接0和链接1吗?

代码语言:javascript
复制
$ nvidia-smi topo --matrix    # Takes over a minute to finish running.

        GPU0    GPU1    CPU Affinity
GPU0     X      NV1     0-11
GPU1    NV1      X      0-11

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

这里我们应该看到NV2链接(即2080 of桥有一对nvlink‘链接’)?

代码语言:javascript
复制
$ simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : Yes
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 Ti (GPU0) supports UVA: Yes
> GeForce RTX 2080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.52GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

AFAIK cudaMemcpyPeer / cudaMemcpy在GPU0和GPU1之间应该管理44 GB/s左右,而不是22 GB/s?

代码语言:javascript
复制
$ p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 529.48   3.20
     1   3.19 532.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 531.71  24.23
     1  24.23 530.74
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 533.58   6.30
     1   6.31 526.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 525.50  48.37
     1  48.41 523.52
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.26  12.74
     1  15.19   1.44

   CPU     0      1
     0   3.92   9.03
     1   8.86   3.82
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.25   0.92
     1   0.96   1.44

   CPU     0      1
     0   4.22   2.78
     1   2.78   3.76

但在这里,“双向P2P=Enabled带宽矩阵”应该显示96 GB/s,而不是48.41。

EN

回答 1

Server Fault用户

发布于 2021-04-21 07:41:19

我有非常类似的经验和根本原因是NVLINK没有正确插入。当您移除NVLINK桥时,您可以重复检查nvidia-smi命令的输出速度。NVLINK有两个侧连接器,应该在GPU上正确地联系,否则nvidia-smi链接-状态命令显示“不活动”,并可能产生这种缓慢的响应或挂起。

票数 0
EN
页面原文内容由Server Fault提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://serverfault.com/questions/1016290

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档