Re: RAID6 dead on the water after Controller failure

From: Phil Turmel <hidden>
Date: 2014-02-15 15:12:49

Good morning Florian,

On 02/15/2014 07:31 AM, Florian Lampel wrote:

Greetings,

first of all - thanks to Phil Turmel for pointing me in the right direction. I checked all the cables and true enough, the System SSD's cable's shielding was halfway peeled off.

Very good.

quoted hunk ↗ jump to hunk

Anyway, the current state is as follows:

*) The missing HDDs came up right after the reboot, and I had to use the "bootdegraded=true" kernel option.
*) All 12 drives are functional.

Here is a link to the requested output of

--- mdadm -E /dev/sd[abcd]1 ---
--- for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done ----

as well as

---- mdadm --examine /dev/sd[abcdefghijklmnop]1 ------

Link:
h__p://pastebin.com/v6yzn3KX

Device order has changed, summary:

/dev/sda1: WD-WMC300595440 Device #4 @442
/dev/sdb1: WD-WMC300595880 Device #5 @442
/dev/sdc1: WD-WMC1T1521826 Device #6 @442
/dev/sdd1: WD-WMC300314126 spare
/dev/sde1: WD-WMC300595645 Device #8 @435
/dev/sdf1: WD-WMC300314217 Device #9 @435
/dev/sdg1: WD-WMC300595957 Device #10 @435
/dev/sdh1: WD-WMC300313432 Device #11 @435
/dev/sdj1: WD-WMC300312702 Device #0 @442
/dev/sdk1: WD-WMC300248734 Device #1 @442
/dev/sdl1: WD-WMC300314248 Device #2 @442
/dev/sdm1: WD-WMC300585843 Device #3 @442

and your SSD is now /dev/sdi.

My findings:
The Event count does differ, but not by much. As my next step, I would follow Phil Turmel's advice and reassemble the Array using the --force option, to be precise:

mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1

Not quite.  What was 'h' is now 'd'.  Use:

mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1

Could you please advise me wether this next step is all right to do now that we have new logs etc.?

Yes.  You may also need "mdadm --stop /dev/md0" first if your boot
process partially assembled the array already.

After assembly, your array will be single-degraded but fully functional.
 That would be a good time to backup any critical data that isn't
already in a backup.

Then you can add /dev/sdd1 back into the array and let it rebuild.

Thanks in advance,
Florian Lampel

PS: Thanks again to Phil for pointing out that --create would be madness.--

One more thing:  your drives report never having a self-test run.  You
should have a cron job that triggers a long background self-test on a
regular basis.  Weekly, perhaps.

Similarly, you should have a cron job trigger an occasional "check"
scrub on the array, too.  Not at the same time as the self-tests,
though.  (I understand some distributions have this already.)

HTH,

Phil

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help