Re: Raid 6 Fail Event

From: Chris Murphy <hidden>
Date: 2014-11-16 19:52:02

On Nov 16, 2014, at 8:39 AM, Justin Stephenson [off-list ref] wrote:

Hello,

I am new to MDADM and have just experienced my first device fail on my raid 6.

I am wondering if someone might be able to help by outlining a proper protocol for troubleshooting and rebuilding this array (proc/mdstat below).

Here is how I might approach it:

- remove the device
- test the device
- if the device tests OK then re add the device
- if the device fails, then replace the device
- resync

Thank-you for your consideration.

Best,

- Justin

Here is the mdstat email

-----------------

This is an automatically generated mail message from mdadm
running on BigBlue

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdh1.

First step is getting the backup current. 

Second you can do this without removing the device:

# smartctl -x /dev/sdh

And then look in dmesg for errors related to its ata designation. You should be able to get a serial number from the smartctl output and can search that with dmesg | grep <serial#> to find out what it’s ata designation (port and device number) is, then you can dmesg | grep ataX.YY to get any read/write error events that explain what’s going on. 

While you’re at it the following would be helpful as well:

# smartctl -l scterc /dev/sdh
# cat /sys/block/sdh/device/state
# cat /sys/block/sdh/device/timeout

These are read-only commands to determine states, they don’t change states so it’s safe.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help