Re: RAID6 dead on the water after Controller failure
From: Florian Lampel <hidden>
Date: 2014-02-15 18:52:27
Am 15.02.2014 um 16:12 schrieb Phil Turmel [off-list ref]:
Good morning Florian,
Good Evening - it's 19:37 here in Austria.
Device order has changed, summary: /dev/sda1: WD-WMC300595440 Device #4 @442 /dev/sdb1: WD-WMC300595880 Device #5 @442 /dev/sdc1: WD-WMC1T1521826 Device #6 @442 /dev/sdd1: WD-WMC300314126 spare /dev/sde1: WD-WMC300595645 Device #8 @435 /dev/sdf1: WD-WMC300314217 Device #9 @435 /dev/sdg1: WD-WMC300595957 Device #10 @435 /dev/sdh1: WD-WMC300313432 Device #11 @435 /dev/sdj1: WD-WMC300312702 Device #0 @442 /dev/sdk1: WD-WMC300248734 Device #1 @442 /dev/sdl1: WD-WMC300314248 Device #2 @442 /dev/sdm1: WD-WMC300585843 Device #3 @442 and your SSD is now /dev/sdi.
Thank you again for going through all those logs and helping me.
Not quite. What was 'h' is now 'd'. Use: mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
Well, that did not went as well as I had hoped. Here is what happened:
root@Lserve:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@Lserve:~# mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 8.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 9.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 10.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 11.
mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 3.
mdadm: forcing event count in /dev/sde1(8) from 435 upto 442
mdadm: forcing event count in /dev/sdf1(9) from 435 upto 442
mdadm: forcing event count in /dev/sdg1(10) from 435 upto 442
mdadm: forcing event count in /dev/sdh1(11) from 435 upto 442
mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sde1
mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sdf1
mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdg1
mdadm: clearing FAULTY flag for device 6 in /dev/md0 for /dev/sdh1
mdadm: Marking array /dev/md0 as 'clean'
mdadm: added /dev/sdk1 to /dev/md0 as 1
mdadm: added /dev/sdl1 to /dev/md0 as 2
mdadm: added /dev/sdm1 to /dev/md0 as 3
mdadm: added /dev/sda1 to /dev/md0 as 4
mdadm: added /dev/sdb1 to /dev/md0 as 5
mdadm: added /dev/sdc1 to /dev/md0 as 6
mdadm: no uptodate device for slot 7 of /dev/md0
mdadm: added /dev/sde1 to /dev/md0 as 8
mdadm: added /dev/sdf1 to /dev/md0 as 9
mdadm: added /dev/sdg1 to /dev/md0 as 10
mdadm: added /dev/sdh1 to /dev/md0 as 11
mdadm: added /dev/sdj1 to /dev/md0 as 0
mdadm: /dev/md0 assembled from 11 drives - not enough to start the array.
AND:
cat /proc/mdstat:
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdj1[0](S) sdh1[11](S) sdg1[10](S) sdf1[9](S) sde1[8](S) sdc1[6](S) sdb1[5](S) sda1[4](S) sdm1[3](S) sdl1[2](S) sdk1[1](S)
21488646696 blocks super 1.0
unused devices: <none>
Seems like every HDD got marked as a spare. Why would mdadm do this, and how can I convince mdadm that they are not spares?
That would be a good time to backup any critical data that isn't already in a backup.
Crashplan had about 30% before it happened. 20TB is a lot to upload.
One more thing: your drives report never having a self-test run. You should have a cron job that triggers a long background self-test on a regular basis. Weekly, perhaps. Similarly, you should have a cron job trigger an occasional "check" scrub on the array, too. Not at the same time as the self-tests, though. (I understand some distributions have this already.)
I will certainly do so in the future. Thanks again everyone, and I hope this will all end well. Thanks, Florian Lampel