|
|
|
Monitoring for failed drives | |
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] |
|
One of the servers I've been setting up, which has an md RAID0 for temporary
storage, has just had a disk error.
root@storage2:~# ls -l /disk/scratch/scratch/path/to/file
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1521.rsi: Remote I/O error
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1522.rsi: Remote I/O error
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1523.rsi: Remote I/O error
...
dmesg shows:
[ 1232.406491] mpt2sas1: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1232.406497] mpt2sas1: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1232.406512] sd 5:0:0:0: [sdr] Unhandled sense code
[ 1232.406514] sd 5:0:0:0: [sdr] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
[ 1232.406518] sd 5:0:0:0: [sdr] Sense Key : Medium Error [current]
[ 1232.406522] Info fld=0x30000588
[ 1232.406524] sd 5:0:0:0: [sdr] Add. Sense: Unrecovered read error
[ 1232.406528] sd 5:0:0:0: [sdr] CDB: Read(10): 28 00 30 00 05 80 00 00 10 00
[ 1232.406537] end_request: critical target error, dev sdr, sector 805307776
OK, so that's fairly obviously a failed drive.
The problem is, how to detect and report this? At the md RAID level,
`cat /proc/mdstat` and `mdadm --detail` show nothing amiss.
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid0 sdk[8] sdf[4] sdb[0] sdj[9] sdc[1] sde[2] sdd[3] sdi[6] sdg[5] sdh[7] sdv[20] sdw[21] sdl[11] sdu[19] sdt[18] sdn[13] sds[17] sdq[14] sdm[10] sdx[22] sdr[16] sdo[12] sdp[15] sdy[23]
70326362112 blocks super 1.2 512k chunks
unused devices: <none>
root@storage2:~# mdadm --detail /dev/md/scratch
/dev/md/scratch:
Version : 1.2
Creation Time : Mon Apr 23 16:53:59 2012
Raid Level : raid0
Array Size : 70326362112 (67068.45 GiB 72014.19 GB)
Raid Devices : 24
Total Devices : 24
Persistence : Superblock is persistent
Update Time : Mon Apr 23 16:53:59 2012
State : clean
Active Devices : 24
Working Devices : 24
Failed Devices : 0
Spare Devices : 0
Chunk Size : 512K
Name : storage2:scratch (local to host storage2)
UUID : e5d2dce6:91d1d3b9:ae08f838:5e12132a
Events : 0
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 64 2 active sync /dev/sde
3 8 48 3 active sync /dev/sdd
4 8 80 4 active sync /dev/sdf
5 8 96 5 active sync /dev/sdg
6 8 128 6 active sync /dev/sdi
7 8 112 7 active sync /dev/sdh
8 8 160 8 active sync /dev/sdk
9 8 144 9 active sync /dev/sdj
10 8 192 10 active sync /dev/sdm
11 8 176 11 active sync /dev/sdl
12 8 224 12 active sync /dev/sdo
13 8 208 13 active sync /dev/sdn
14 65 0 14 active sync /dev/sdq
15 8 240 15 active sync /dev/sdp
16 65 16 16 active sync /dev/sdr
17 65 32 17 active sync /dev/sds
18 65 48 18 active sync /dev/sdt
19 65 64 19 active sync /dev/sdu
20 65 80 20 active sync /dev/sdv
21 65 96 21 active sync /dev/sdw
22 65 112 22 active sync /dev/sdx
23 65 128 23 active sync /dev/sdy
So first question is this: what does it take for a drive to be marked as
"failed" by md RAID? Is there some threshold I can set?
Second question: what's a better way of monitoring this proactively, rather
than just waiting for applications to fail and then digging into dmesg?
Recently I installed an excellent set of snmp plugins and MIBs for exposing
both md-raid and smartctl information via SNMP, which I got from
http://www.mad-hacking.net/software/index.xml
http://downloads.mad-hacking.net/software/
Here's the md RAID output (which really is just reformatting of info
from mdadm --detail)
root@storage2:~# snmptable -c XXXXXXXX -v 2c storage2 MD-RAID-MIB::mdRaidTableSNMP table: MD-RAID-MIB::mdRaidTable
mdRaidArrayIndex mdRaidArrayDev mdRaidArrayVersion mdRaidArrayUUID mdRaidArrayLevel mdRaidArrayLayout mdRaidArrayChunkSize mdRaidArraySize mdRaidArrayDeviceSize mdRaidArrayHealthOK mdRaidArrayHasFailedComponents mdRaidArrayHasAvailableSpares mdRaidArrayTotalComponents mdRaidArrayActiveComponents mdRaidArrayWorkingComponents mdRaidArrayFailedComponents mdRaidArraySpareComponents
1 /dev/md/scratch 1.2 e5d2dce6:91d1d3b9:ae08f838:5e12132a raid0 N/A 512K 70326362112 N/A true false false 24 24 24 0 0
And here's the output for SMART (which combines smartctl -i, -H and -A):
root@storage2:~# snmptable -c XXXXXXXX -v 2c storage2 SMARTCTL-MIB::smartCtlTable
SNMP table: SMARTCTL-MIB::smartCtlTable
smartCtlDeviceIndex smartCtlDeviceDev smartCtlDeviceModelFamily smartCtlDeviceDeviceModel smartCtlDeviceSerialNumber smartCtlDeviceUserCapacity smartCtlDeviceATAVersion smartCtlDeviceHealthOK smartCtlDeviceTemperatureCelsius smartCtlDeviceReallocatedSectorCt smartCtlDeviceCurrentPendingSector smartCtlDeviceOfflineUncorrectable smartCtlDeviceUDMACRCErrorCount smartCtlDeviceReadErrorRate smartCtlDeviceSeekErrorRate smartCtlDeviceHardwareECCRecovered
1 /dev/sda ST1000DM003-9YN162 Z1D0BQHF 1,000,204,886,016 bytes [1.00 TB] 8 true 28 0 0 0 0 105 30 ?
2 /dev/sdb ST3000DM001-9YN166 S1F01Z36 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 0 0 0 105 31 ?
3 /dev/sdc ST3000DM001-9YN166 S1F01932 3,000,592,982,016 bytes [3.00 TB] 8 true 24 0 0 0 0 103 31 ?
4 /dev/sdd ST3000DM001-9YN166 S1F04Y7G 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 104 31 ?
5 /dev/sde ST3000DM001-9YN166 S1F00KF2 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
6 /dev/sdf ST3000DM001-9YN166 S1F01C0D 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 103 31 ?
7 /dev/sdg ST3000DM001-9YN166 S1F01DFM 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
8 /dev/sdh ST3000DM001-9YN166 S1F054EP 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 105 31 ?
9 /dev/sdi ST3000DM001-9YN166 S1F05304 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 105 31 ?
10 /dev/sdj ST3000DM001-9YN166 S1F015X5 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 105 31 ?
11 /dev/sdk ST3000DM001-9YN166 S1F046FB 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 103 31 ?
12 /dev/sdl ST3000DM001-9YN166 S1F024DW 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
13 /dev/sdm ST3000DM001-9YN166 S1F04DKQ 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
14 /dev/sdn ST3000DM001-9YN166 S1F014NH 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
15 /dev/sdo ST3000DM001-9YN166 S1F049KM 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 105 31 ?
16 /dev/sdp ST3000DM001-9YN166 S1F01D5A 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
17 /dev/sdq ST3000DM001-9YN166 S1F00L20 3,000,592,982,016 bytes [3.00 TB] 8 true 24 0 0 0 0 103 31 ?
18 /dev/sdr ST3000DM001-9YN166 S1F07PN8 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 8 8 0 81 31 ?
19 /dev/sds ST3000DM001-9YN166 S1F03PS8 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
20 /dev/sdt ST3000DM001-9YN166 S1F04SM4 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 103 31 ?
21 /dev/sdu ST3000DM001-9YN166 S1F00MCQ 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 105 31 ?
22 /dev/sdv ST3000DM001-9YN166 S1F020YG 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 0 0 0 104 31 ?
23 /dev/sdw ST3000DM001-9YN166 S1F03NXP 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
24 /dev/sdx ST3000DM001-9YN166 S1F054Y7 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 104 31 ?
25 /dev/sdy ST3000DM001-9YN166 S1F04A0Y 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 40 40 0 105 31 ?
All drives report smartCtlDeviceHealthOK = True, which derives from the
test "PASSED" result from smartctl -H:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.0-16-server] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
The only anomoly I can see here is that sdr has reported 8 unrecoverable
errors - and also sdy has reported 40 unrecoverable errors!
So based on this information, I am going to return sdr and sdy to the
manufacturer for replacement.
But is there any better way that I can be notified quickly of I/O errors
and/or retries, for example counters being maintained in the kernel?
Thanks,
Brian.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
![]() |
![]() |