Re: RAID5 with 2 drive failure at the same time

From: Robin Hill <hidden>
Date: 2013-01-31 13:45:31

On Thu Jan 31, 2013 at 02:15:00PM +0100, Christoph Nelles wrote:

Hello Robin,

thanks for the answers :)

Am 31.01.2013 12:38, schrieb Robin Hill:

quoted

Probably only one drive failed. If the rebuild was incomplete then a
single drive failure would cause the array to fail. Can you post the
errors? If the issue was a read failure then you'll need to fix that
before the array can be recovered properly.

All drives are available again. And the seecond failed device reports
UREs. I will run badblocks on that device before continuing.
I attached the kernel logs of the first error and of the second error. I
hope i filtered them reasonably.

Okay, those show that sdj had a read error during the rebuild. That
would have kicked the drive and failed the rebuild (and the array).

Your earlier error with sdg is a different issue. It looks to have timed
out on a write and then errored again when resetting the drive.

If you're using standard desktop drives then you may be running into
issues with the drive timeout being longer than the kernel's. You need
to reset on or the other to ensure that the drive times out (and is
available for subsequent commands) before the kernel does. Most current
consumer drives don't allow resetting the timeout, but it's worth trying
that first before changing the kernel timeout. For each
drive, do:
    smartctl -l scterc,70,70 /dev/sdX
        || echo 180 > /sys/block/sdX/device/timeout

That'll need to be run on every boot (or whenever a drive is
hot-plugged).

quoted

When examining the drives, sdj1 has the information from before the crash:
   Device Role : Active device 5
   Array State : AAAAAAAAA ('A' == active, '.' == missing)

sdg1 looks like this
   Device Role : spare
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

The other look like
   Device Role : Active device 6
   Array State : A.AAA.AAA ('A' == active, '.' == missing)

From the looks of it, sdg1 was the drive you were originally adding back
into the array, and sdj1 is the drive that failed part-way through the
rebuild?

Exactly. I am running badblocks on that device. SMART reports one
"Pending Sector Count" :(

That means you'll end up with some corruption. Whether that affects any
data or not will depend on exactly where it is.

quoted

So looks that my repair tries made sdg1 a spare :\ I attached the full
output to this mail.

Is there anyway to restart the RAID from the information contained in
drive sdj1? Perhaps via Incremental Build starting from one drive? Could
that work? If the RAID wouldn't have been rebuilding before the crash, i
would just recreate it with --assume-clean.

The first thing to try should _always_ be a forced assemble. Recreating
the array is very much a last-ditch move and should never be attempted
before asking the list for help (any mismatch in your create command, or
in the mdadm/kernel versions could cause data corruption). Stop the
array, then reassemble with the --force flag. It'll probably restart
with sdj1 added back into the array, and you can then add sdg1 back in
again and restart the rebuild.

So
# mdadm -A /dev/md0 -f /dev/sdc1 /dev/sdg1 /dev/sdh1 /dev/sdd1 \
/dev/sdi1 /dev/sdj1 /dev/sdb1 /dev/sdf1 /dev/sde1

should work? That would be a really simple solution :)


On sdj1 there is still a superblock from before the crash, while the
others have newer updated superblocks. are there any means to say that
the RAID should be assembled with the older information from this
particular superblock?

That'll be done automatically - mdadm looks at the event counters for
all the disks and assembles the array using the best set (if possible).
As sdj failed during the rebuild, taking down the array, there shouldn't
be any issues with doing this.

However, given you have unreadable blocks on sdj then you'll need to
sort that out first. (or you'll never be able to complete the rebuild).
Use ddrescue to copy the whole of sdj onto sdg (barring the unreadable
blocks). You can then force assemble the array using the other drives:

    mdadm -A /dev/md0 -f /dev/sdc1 /dev/sdg1 /dev/sdh1 /dev/sdd1 \
        /dev/sdi1 /dev/sdb1 /dev/sdf1 /dev/sde1

If that starts up okay then you can add sdj1 back into the array. You'll
need to run a fsck on the array afterwards to pick up what corruption
there's been (fsck -f /dev/md0).

Good luck,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        [off-list ref] |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

Attachments

(unnamed) [application/pgp-signature] 198 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help