Thread (12 messages) 12 messages, 4 authors, 2014-02-16

Re: RAID6 dead on the water after Controller failure

From: Phil Turmel <hidden>
Date: 2014-02-15 15:12:49

Good morning Florian,

On 02/15/2014 07:31 AM, Florian Lampel wrote:
Greetings,

first of all - thanks to Phil Turmel for pointing me in the right direction. I checked all the cables and true enough, the System SSD's cable's shielding was halfway peeled off.
Very good.
quoted hunk ↗ jump to hunk
Anyway, the current state is as follows:

*) The missing HDDs came up right after the reboot, and I had to use the "bootdegraded=true" kernel option.
*) All 12 drives are functional.

Here is a link to the requested output of 
--- mdadm -E /dev/sd[abcd]1 ---
--- for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done ----
as well as

---- mdadm --examine /dev/sd[abcdefghijklmnop]1 ------

Link:
h__p://pastebin.com/v6yzn3KX
Device order has changed, summary:

/dev/sda1: WD-WMC300595440 Device #4 @442
/dev/sdb1: WD-WMC300595880 Device #5 @442
/dev/sdc1: WD-WMC1T1521826 Device #6 @442
/dev/sdd1: WD-WMC300314126 spare
/dev/sde1: WD-WMC300595645 Device #8 @435
/dev/sdf1: WD-WMC300314217 Device #9 @435
/dev/sdg1: WD-WMC300595957 Device #10 @435
/dev/sdh1: WD-WMC300313432 Device #11 @435
/dev/sdj1: WD-WMC300312702 Device #0 @442
/dev/sdk1: WD-WMC300248734 Device #1 @442
/dev/sdl1: WD-WMC300314248 Device #2 @442
/dev/sdm1: WD-WMC300585843 Device #3 @442

and your SSD is now /dev/sdi.
My findings:
The Event count does differ, but not by much. As my next step, I would follow Phil Turmel's advice and reassemble the Array using the --force option, to be precise:

mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1
Not quite.  What was 'h' is now 'd'.  Use:

mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
Could you please advise me wether this next step is all right to do now that we have new logs etc.?
Yes.  You may also need "mdadm --stop /dev/md0" first if your boot
process partially assembled the array already.

After assembly, your array will be single-degraded but fully functional.
 That would be a good time to backup any critical data that isn't
already in a backup.

Then you can add /dev/sdd1 back into the array and let it rebuild.
Thanks in advance,
Florian Lampel

PS: Thanks again to Phil for pointing out that --create would be madness.--
One more thing:  your drives report never having a self-test run.  You
should have a cron job that triggers a long background self-test on a
regular basis.  Weekly, perhaps.

Similarly, you should have a cron job trigger an occasional "check"
scrub on the array, too.  Not at the same time as the self-tests,
though.  (I understand some distributions have this already.)

HTH,

Phil
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help