我的操作系统引起了内核恐慌(看起来触发了另一个内核转储,kdump?)
[ 124.674715] core: Uncorrected hardware memory error in user-access at xxxxxxx
[ 124.684140] BUG: scheduling while atomic: einj_mem_uc/5151/0xxxxxxxxx
[ 124.684310] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
r = 0xxxxxxxxxxx[ 124.691839] Memory failure: 0x25eae3: Killing einj_mem_uc:6161 due to hardware memory corruption
[ 124.700827] {1}[Hardware Error]: event severity: recoverable
[ 124.700828] {1}[Hardware Error]: Error 0, type: recoverable
00 paddr = xxxxx[ 124.700829] {1}[Hardware Error]: fru_text: Card01, ChnE, DIMM0
[ 124.700830] {1}[Hardware Error]: section_type: memory error
[ 124.700835] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 124.712309] Memory failure: 0x25eae3: recovery action for dirty LRU page: Recovered
[ 124.718713] {1}[Hardware Error]: physical_address: 0x000000015ace3400
[ 124.718715] {1}[Hardware Error]: node: 0 card: 4 module: 0 rank: 0 bank: 21 device: 0 row: 10455 column: 1408
[ 124.718716] {1}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 124.718718] {1}[Hardware Error]: DIMM location: _Node0_Channel4_Dimm0 CPU0_E0
[ 124.791089] Memory failure: 0x25eae3: already hardware poisoned
3 116
400
[ 0.000000] Linux version 4.18.0-348.el8.x86_64 我检查了源代码:
https://elixir.bootlin.com/linux/v4.18/source/kernel/sched/core.c#L3287
操作系统只有在panic_on_warn == 1时才会恐慌,但我检查了我的操作系统:
sudo sysctl -a | grep -i panic_on
...
kernel.panic_on_warn = 0发布于 2022-08-27 00:35:16
好吧,为了证实我在这里的评论--感谢你提供的补充信息:
内核不会因为BUG而恐慌:调度时使用原子(与kernel.panic_on_warn = 0一样,并不是一个合理的请求),更明显的原因是由MCE中断处理程序检测到的重复硬件内存故障,并且可能是该处理程序中一些致命问题的根源。
https://unix.stackexchange.com/questions/715005
复制相似问题