Re: RAID5 with 2 drive failure at the same time
From: Robin Hill <hidden>
Date: 2013-01-31 11:38:20
On Thu Jan 31, 2013 at 11:42:54 +0100, Christoph Nelles wrote:
Hi, i hope somebody on this ML can help me. My RAID5 died last night during a rebuild when two drives failed (looks like a sata_mv problem). The RAID5 was rebuilding because one of the two drives failed before and after running badblocks for 2 days, i re-added it to the RAID.
Probably only one drive failed. If the rebuild was incomplete then a single drive failure would cause the array to fail. Can you post the errors? If the issue was a read failure then you'll need to fix that before the array can be recovered properly.
The used drives are from /dev/sdb1 to /dev/sdj1 (9 Drives, RAID5), the failed drives are sdj1 and sdg1
You also seriously need to look at moving to RAID6. Using RAID5 for a 9-drive array is not a good idea, and with 3TB drives it's absolutely crazy. The odds of a single read error out of the 24TB that needs to be read to recover a drive are not insignificant.
The current situation is that I cannot start the RAID. I wanted to try
readding on of the the drives, so removed it beforehand, making it a
spare :\ The layout is as follows:
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 0 0 1 removed
2 8 113 2 active sync /dev/sdh1
3 8 49 3 active sync /dev/sdd1
4 8 129 4 active sync /dev/sdi1
5 0 0 5 removed
6 8 17 6 active sync /dev/sdb1
7 8 81 7 active sync /dev/sdf1
8 8 65 8 active sync /dev/sde1
Re-adding fails with a simple message:
# mdadm -v /dev/md0 --re-add /dev/sdg1
mdadm: --re-add for /dev/sdg1 to /dev/md0 is not possible
I tried re-adding both failed drives at the same, with the same result.That's good anyway - it prevented the loss of the existing metadata which would definitely have reduced your chances of recovery.
When examining the drives, sdj1 has the information from before the crash:
Device Role : Active device 5
Array State : AAAAAAAAA ('A' == active, '.' == missing)
sdg1 looks like this
Device Role : spare
Array State : A.AAA.AAA ('A' == active, '.' == missing)
The other look like
Device Role : Active device 6
Array State : A.AAA.AAA ('A' == active, '.' == missing)From the looks of it, sdg1 was the drive you were originally adding back into the array, and sdj1 is the drive that failed part-way through the rebuild?
So looks that my repair tries made sdg1 a spare :\ I attached the full output to this mail. Is there anyway to restart the RAID from the information contained in drive sdj1? Perhaps via Incremental Build starting from one drive? Could that work? If the RAID wouldn't have been rebuilding before the crash, i would just recreate it with --assume-clean.
The first thing to try should _always_ be a forced assemble. Recreating
the array is very much a last-ditch move and should never be attempted
before asking the list for help (any mismatch in your create command, or
in the mdadm/kernel versions could cause data corruption). Stop the
array, then reassemble with the --force flag. It'll probably restart
with sdj1 added back into the array, and you can then add sdg1 back in
again and restart the rebuild.
Cheers,
Robin
--
___
( ' } | Robin Hill [off-list ref] |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" | Attachments
- (unnamed) [application/pgp-signature] 198 bytes