RE: RAID5 / 6 Growth

From: Leslie Rhorer <hidden>
Date: 2009-12-19 01:11:59

quoted

the entire array.  The question is particularly pertinent given the fact

the

quoted

growth is going to take nearly 5 days (a lot can happen in 5 days), and

the

quoted

fact the system was having the rather squirrelly issue a few days back

which

quoted

seems - emphasis on SEEMS - to have been resolved by disabling NCQ.

What

quoted

happens if the system kicks a couple of drives, especially if one drive

gets

quoted

kicked, a bunch of data gets written and then a few minutes later

another

quoted

drive gets kicked?  In particular, what if neither of the two drives

that

quoted

get kicked are the new drive?

Well, what happens if two drives get kicked in normal use over the
course of 5 days?

	Nothing of any consequence, unless it happens in quick succession.
When drive A is kicked, if it is spurious, then the drive is simply added
back and a resync performed.  If the drive actually failed, then it is
replaced, and once again a resync is done.  Either way, it takes vastly less
time than a growth.  Assuming at least one of the kicks is not an
out-and-out drive failure, then recovering the bulk of the data is fairly
easy.  That may not be the case with two drives kicked during a growth,
since a big chunk of the data on the last drive will be completely missing.
What's more, one is left with an array which has neitehr properly N nor N +
1 drives, but is in the process of changing from one to the other.  Again,
recovering from a failed resync or a sudden non-drive failure (like a power
failure or a drive cable being accidentally yanked) is fairly easy.  I don't
know what will happen if one of the drive cables feeding three of the drives
is accidentally yanked.  That's why I am asking.

I think you're being overly cautious, and I'll try to
explain why.

The reshape only reduces redundancy during the "critical section". After
that, you're as redundant as usual and can tolerate a drive failure. On
RAID-6, 2 drive failures.

	Yes, I know.  I've experienced a number of issues where two or more
drives have been taken offline by md, though.  As I say, recovering from
this when the array was in a stable configuration is not too difficult,
perhaps even without data loss.  What happens when the array is taken
offline and it has neither properly 7 nor 8 drives is a real question,
though.  Obviously, if the array can resume its expansion where it left off
after a failure event, then it is not an issue, but according to one of the
other correspondents, this feature is not available in my version of mdadm.

A reshape should be considerably safer than
doing a resync to a replacement drive, because in the reshape case if
you get bad sectors md can regenerate the data from the parity info.

	Except that it takes many times longer, significantly increasing the
likelihood of such a failure during the event.

Do you regularly run a check on your array? Or have you done one
recently? And does the SMART info on all your drives look OK? These
should be the case before attempting any reshape anyway,

	Yes, but that did not stop md from halting the array multiple times
during resyncs when NCQ was enabled.  Disabling NCQ seems to have alleviated
the issue, but I have no guarantees it won't happen again during the growth.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help