首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何检查/修复nvme健康?

如何检查/修复nvme健康?
EN

Unix & Linux用户
提问于 2019-05-13 08:55:00
回答 2查看 8.8K关注 0票数 6

我运行debian稳定与2 x nvme Raid 1。

这是在https://www.hetzner.com/dedicated-rootserver/ex62-nvme?country=us上运行的硬件/主机

几乎每隔一天,mdadm监控都会报告一个失败事件,并使数组退化。

它只禁用一个分区,如下所示:

代码语言:javascript
复制
This is an automatically generated mail message from mdadm
running on xxx

A Fail event had been detected on md device /dev/md/2.

It could be related to component device /dev/nvme1n1p3.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 nvme1n1p3[1](F) nvme0n1p3[0]
      465895744 blocks super 1.2 [2/1] [U_]
      bitmap: 4/4 pages [16KB], 65536KB chunk

md0 : active (auto-read-only) raid1 nvme1n1p1[1] nvme0n1p1[0]
      33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme0n1p2[0] nvme1n1p2[1]
      523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

这发生在两个磁盘上。一次是nvme0n1p3,下一次是nvme1n1p3。

然后,我将失败的分区重新添加到

代码语言:javascript
复制
mdadm --re-add /dev/md2 /dev/nvme0n1p3

代码语言:javascript
复制
mdadm --re-add /dev/md2 /dev/nvme1n1p3

在重新合成之后,它工作了一两天。

在dmesg中,我发现了这个:

代码语言:javascript
复制
[94879.144892] nvme nvme1: I/O 311 QID 1 timeout, reset controller
[94879.252851] nvme nvme1: completing aborted command with status: 0007
[94879.252970] blk_update_request: I/O error, dev nvme1n1, sector 452352001
[94879.253091] nvme nvme1: completing aborted command with status: fffffffc
[94879.253223] blk_update_request: I/O error, dev nvme1n1, sector 68159504
[94879.253418] md: super_written gets error=-5

我试着用这些命令检查设备的健康状况,但它们没有给出像"Reallocated_Sector_Ct“或"Reported_Uncorrect”这样的统计数据。

代码语言:javascript
复制
smartctl -x /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KXG50ZNV512G TOSHIBA
Serial Number:                      28SS10F6TYST
Firmware Version:                   AAGA4102
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon May 13 10:34:11 2019 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    3  3  3  3     1500    1500
 4 -   0.0050W       -        -    4  4  4  4     6000   14000
 5 -   0.0030W       -        -    5  5  5  5    50000   80000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    57%
Data Units Read:                    31,858,921 [16.3 TB]
Data Units Written:                 293,589,002 [150 TB]
Host Read Commands:                 4,130,502,428
Host Write Commands:                889,121,505
Controller Busy Time:               13,552
Power Cycles:                       7
Power On Hours:                     6,720
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius

Error Information (NVMe Log 0x01, max 128 entries)
No Errors Logged

nvme smart-log /dev/nvme1

Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 47 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 57%
data_units_read                     : 31,858,921
data_units_written                  : 293,589,023
host_read_commands                  : 4,130,502,429
host_write_commands                 : 889,122,059
controller_busy_time                : 13,552
power_cycles                        : 7
power_on_hours                      : 6,720
unsafe_shutdowns                    : 0
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 47 C
Temperature Sensor 2                : 0 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

nvme smart-log-add /dev/nvme1

NVMe Status:INVALID_LOG_PAGE(4109)

smartctl -A /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    57%
Data Units Read:                    31,858,924 [16.3 TB]
Data Units Written:                 293,591,327 [150 TB]
Host Read Commands:                 4,130,502,490
Host Write Commands:                889,172,096
Controller Busy Time:               13,552
Power Cycles:                       7
Power On Hours:                     6,721
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius

只有在apache启动失败之后,我才注意到这个问题,我用fsck.ext4 -f修复了文件系统。在我没有正确设置根邮件之前。

因此,在我看来,这是一个硬件错误,我应该去掉这两个nvme。

有什么东西我可以试着解决这些问题并保存nvme吗?或者至少获得"Reported_Uncorrect“或"Offline_Uncorrectable”这样的智能值。

EN

回答 2

Unix & Linux用户

发布于 2019-05-13 08:58:05

smartctl -A /dev/nvme<xxxx>怎么样?

来自帮助:

-A, --attributes Show device SMART vendor-specific Attributes and values

票数 0
EN

Unix & Linux用户

发布于 2021-07-08 19:36:47

由于您共享的智能日志中的媒体和数据完整性错误为0,因此似乎没有不可纠正的ECC或CRC问题。若要获取PCIe上发生的无法更正的信息,可以尝试读取设备的PCIe AER。

票数 0
EN
页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://unix.stackexchange.com/questions/518643

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档