Re[2]: Raid 6 Fail Event
From: Justin Stephenson <hidden>
Date: 2014-11-17 01:34:38
Thank-you, Chris. I appreciate your help with this.
Backup are good. I'm a regular disk to disk to LTO guy. Here is what I
have turned up:
================================
# smartctl -x /dev/sdh
big long list of stuff. I found the serial.
I also tried smartctl -H /dev/sdh and received
Overall-health self-assesment test restul: PASSED
184 End-to-End_Error {flag value worst thresh} Old_age FAILING_NOW_6
I did not find anything for the serial in results from dmesg
# smartctl -l scterc /dev/sdh
Warning: device does not support SCT Commands
# cat /sys/block/sdh/device/state
Running
# cat /sys/block/sdh/device/timeout
30
================================
Should I replace the drive or re add and resync?
I also went through and reseated all the SATA and power connections as I
understand these can cause issues as well.
Best,
- J
------ Original Message ------
From: "Chris Murphy" <redacted>
To: "Justin Stephenson" <redacted>
Cc: linux-raid@vger.kernel.org
Sent: 16/11/2014 2:52:02 PM
Subject: Re: Raid 6 Fail Event
On Nov 16, 2014, at 8:39 AM, Justin Stephenson [off-list ref] wrote:quoted
Hello, I am new to MDADM and have just experienced my first device fail on my raid 6. I am wondering if someone might be able to help by outlining a proper protocol for troubleshooting and rebuilding this array (proc/mdstat below). Here is how I might approach it: - remove the device - test the device - if the device tests OK then re add the device - if the device fails, then replace the device - resync Thank-you for your consideration. Best, - Justin Here is the mdstat email ----------------- This is an automatically generated mail message from mdadm running on BigBlue A Fail event had been detected on md device /dev/md0. It could be related to component device /dev/sdh1.First step is getting the backup current. Second you can do this without removing the device: # smartctl -x /dev/sdh And then look in dmesg for errors related to its ata designation. You should be able to get a serial number from the smartctl output and can search that with dmesg | grep <serial#> to find out what it’s ata designation (port and device number) is, then you can dmesg | grep ataX.YY to get any read/write error events that explain what’s going on. While you’re at it the following would be helpful as well: # smartctl -l scterc /dev/sdh # cat /sys/block/sdh/device/state # cat /sys/block/sdh/device/timeout These are read-only commands to determine states, they don’t change states so it’s safe. Chris Murphy
-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html