首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >PAGE_FAULT_IN_NONPAGED_AREA -火炬

PAGE_FAULT_IN_NONPAGED_AREA -火炬
EN

Stack Overflow用户
提问于 2022-01-22 09:19:09
回答 1查看 416关注 0票数 1

BSOD遵循VRAM密集型应用程序的开放,同时在GPU上运行培训。在过去,我有规律地打开过这样的应用程序,没有问题,有更大的模型,如果它们没有足够的内存(这是训练做不到的),应用程序就会使用共享内存--不知道现在有什么不同。

__main__try-except下运行培训,转储文件通知(如这里),

通常,所引用的地址位于已释放的内存中,或者根本无效。除了处理程序之外,这不能由try来保护,只能通过探测或类似的编程技术来保护。

如果我用try-finally替换,它几乎是相同的消息,只是它实际上是try-finally,尽管在非__main__代码中有许多try-except领先。我还没有尝试在没有try的情况下运行__main__

训练通过PyTorch闪电,AMP16,没有多处理。它不易复制,转储文件似乎没有指向Python代码。怎么才能调剂呢?

Env info:Windows 10 x64,RTX 2080 Super,AMD Ryzen 5 3600,BIOS 4602,最新驱动程序

Python 3.8.12,Py手电1.10.0,Py手电筒-闪电1.5.7

转储摘要

代码语言:javascript
复制
************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*
Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 19041 MP (12 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS Personal
Edition build lab: 19041.1.amd64fre.vb_release.191206-1406
Machine Name:
Kernel base = 0xfffff803`80e00000 PsLoadedModuleList = 0xfffff803`81a2a2b0
Debug session time: Fri Jan 21 19:46:53.513 2022 (UTC + 4:00)
System Uptime: 0 days 0:57:47.148
Loading Kernel Symbols
...............................................................
................................................................
................................................................
..........................
Loading User Symbols
PEB is paged out (Peb.Ldr = 00000099`5954d018).  Type ".hh dbgerr001" for details
Loading unloaded module list
............
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff803`811f72e0 48894c2408      mov     qword ptr [rsp+8],rcx ss:0018:ffffd48c`001d3c10=0000000000000050
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffff9b84f54fe000, memory referenced.
Arg2: 0000000000000002, value 0 = read operation, 1 = write operation.
Arg3: fffff803966bd413, If non-zero, the instruction address which referenced the bad memory
    address.
Arg4: 0000000000000000, (reserved)

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : AV.Type
    Value: Write

    Key  : Analysis.CPU.mSec
    Value: 4937

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 7978

    Key  : Analysis.Init.CPU.mSec
    Value: 734

    Key  : Analysis.Init.Elapsed.mSec
    Value: 7797

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 85

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19041.1


FILE_IN_CAB:  MEMORY.DMP

BUGCHECK_CODE:  50

BUGCHECK_P1: ffff9b84f54fe000

BUGCHECK_P2: 2

BUGCHECK_P3: fffff803966bd413

BUGCHECK_P4: 0

READ_ADDRESS:  ffff9b84f54fe000 Paged pool

MM_INTERNAL_CODE:  0

IMAGE_NAME:  nvlddmkm.sys

MODULE_NAME: nvlddmkm

FAULTING_MODULE: fffff80395f30000 nvlddmkm

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXNTFS: 1 (!blackboxntfs)


BLACKBOXPNP: 1 (!blackboxpnp)


BLACKBOXWINLOGON: 1

PROCESS_NAME:  python.exe

TRAP_FRAME:  ffffd48c001d3eb0 -- (.trap 0xffffd48c001d3eb0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffff9b84f54fd278 rbx=0000000000000000 rcx=ffff9b84f54fe010
rdx=00005c7ea1422a48 rsi=0000000000000000 rdi=0000000000000000
rip=fffff803966bd413 rsp=ffffd48c001d4048 rbp=ffff9b84f54e2000
 r8=0000000000000000  r9=0000000000000012 r10=0000000000000000
r11=fffff80396920ed8 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
nvlddmkm+0x78d413:
fffff803`966bd413 0f2941f0        movaps  xmmword ptr [rcx-10h],xmm0 ds:ffff9b84`f54fe000=????????????????????????????????
Resetting default scope

STACK_TEXT:  
ffffd48c`001d3c08 fffff803`8124a81f     : 00000000`00000050 ffff9b84`f54fe000 00000000`00000002 ffffd48c`001d3eb0 : nt!KeBugCheckEx
ffffd48c`001d3c10 fffff803`8109f4d0     : 00000001`00000000 00000000`00000002 ffffd48c`001d3f30 00000000`00000000 : nt!MiSystemFault+0x18d32f
ffffd48c`001d3d10 fffff803`8120545e     : 00000000`00000000 00000000`00000001 fffff803`81a50bc0 ffff9b84`f54fd000 : nt!MmAccessFault+0x400
ffffd48c`001d3eb0 fffff803`966bd413     : fffff803`96c589ad fffff803`968dee68 00000000`00000000 00000000`00000000 : nt!KiPageFault+0x35e
ffffd48c`001d4048 fffff803`96c589ad     : fffff803`968dee68 00000000`00000000 00000000`00000000 00000000`00000000 : nvlddmkm+0x78d413
ffffd48c`001d4050 fffff803`96c70d46     : 00000000`00000000 ffffd48c`001d4100 ffff8b06`09f02000 ffff9b84`f54e2000 : nvlddmkm!nvDumpConfig+0x3e750d
ffffd48c`001d4090 fffff803`96d03f47     : ffffd48c`001d4260 fffff803`81a4f0c0 00000000`00000001 ffffd48c`001d46b8 : nvlddmkm!nvDumpConfig+0x3ff8a6
ffffd48c`001d4140 fffff803`96d05607     : 00000000`c000000d ffffd48c`001d4280 00000000`00000001 ffff8b06`09f02000 : nvlddmkm!nvDumpConfig+0x492aa7
ffffd48c`001d4180 fffff803`96c8b9a3     : 00000000`c000000d ffffd48c`001d4469 ffffd48c`001d46b8 00000000`c000000d : nvlddmkm!nvDumpConfig+0x494167
ffffd48c`001d43e0 fffff803`90e1afba     : 00000000`00000000 ffff8b06`744ad300 00000000`00000000 00000000`4e562a2a : nvlddmkm!nvDumpConfig+0x41a503
ffffd48c`001d44d0 fffff803`90cf9c39     : ffff8b06`0f21b868 ffffd48c`00000000 ffff8b06`0f21b868 ffffffff`ffffffff : dxgkrnl!TdrIsEnabled+0x821ca
ffffd48c`001d4580 fffff803`81208cb8     : 00000206`40efb900 ffff8b06`1d426080 00000000`00000000 ffff8b06`00000000 : dxgkrnl!NtGdiDdDDIEscape+0x1879
ffffd48c`001d4b00 00007fff`b2bf4be4     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x28
00000099`5b30def8 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007fff`b2bf4be4


SYMBOL_NAME:  nvlddmkm+78d413

STACK_COMMAND:  .cxr; .ecxr ; kb

BUCKET_ID_FUNC_OFFSET:  78d413

FAILURE_BUCKET_ID:  AV_W_(null)_nvlddmkm!unknown_function

OS_VERSION:  10.0.19041.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {838100fa-f28b-2ef7-d702-e31713cb338c}

Followup:     MachineOwner
---------

全转储 (~5GB未压缩)

EN

回答 1

Stack Overflow用户

发布于 2022-02-09 07:39:31

我也有过同样的问题。PAGE_FAULT_IN_NONPAGED_AREA,nvlddmkm.sys,python.exe跑火把。我能够始终如一地复制它。

解决方案是使用对NVIDIA驱动程序进行核武器化,然后让Windows安装其版本的NVIDIA驱动程序(后者更稳定)。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70811473

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档