我在SuperMicro MBD-X9DRD-EF主板上有服务器。在这一年中,它在CentOS7上工作得很好,有一个CPU (英特尔最初的Xeon X6 E5-2620v2)和128个Gb (8x16GB)的LVDDR (1600 the关键的ECC (PC3-12800)内存)。上个月,我们通过添加第二个CPU和额外的128 Gb内存来升级这个服务器,这与现有的内存完全相同。但是,在大量使用服务器(在3-4天内)之后,我们开始接收(非常频繁)这样的错误:
[root@GBserver log]# dmesg
[614781.869098] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[614781.869104] EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010090
[614781.869106] EDAC sbridge MC1: TSC 0
[614781.869108] EDAC sbridge MC1: ADDR 38126a6c40
[614781.869110] EDAC sbridge MC1: MISC 14066ca86
[614781.869112] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1473082855 SOCKET 1 APIC 20
[614782.595676] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x38126a6 offset:0xc40 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:1)edac-util输出:
[root@GBserver log]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errorsmc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182校正错误
这些错误是由主板、CPU或OS故障引起的,还是内存芯片损坏了?我们该怎么办?如何找出损坏的内存模块?
发布于 2016-09-19 13:55:18
3周后,有大约11M纠正错误记录。看到BIOS日志后,我发现内存模块坏了。

这就是我的问题的答案。
接下来,我将删除已损坏的模块并将其替换为另一个模块。
https://serverfault.com/questions/801150
复制相似问题