Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock
From: Phil Turmel <hidden>
Date: 2014-01-15 13:35:22
On 01/15/2014 07:50 AM, Wilson Jonathan wrote:
On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:quoted
On 01/14/2014 12:47 PM, Wilson Jonathan wrote: [trim /]quoted
I understand the issue of "timeout" on drives that might perform long error checking which then causes mdadm, via the device (block?) driver issuing a time out, to then kick the drive. In this instance you allow some time for a drive to try and fix things at the expense of a hung array for a longer period of time. I also understand that with scterc the drive gives up (in effect timing its self out) when it hits the 7 second, or there about, mark and subsequently mdadm kicks the drive out. In this specific instance the idea is to kill a drive quickly to that the raid doesn't hang longer than a few seconds.No. The intent is to fail the read without failing the controller channel.Arrr, thanks for the clarification... I hadn't realised that instead of the drive returning a "Error, I can't get the data, I'm dead in the water" message it instead returned a "warning, I can't get the data, you deal with it and get back to me, I'm still working" kind of affair.
Let me emphasize one point here: while a drive is performing error recovery, it *stops talking to the controller*. The drive isn't replying with a warning as you suggest--it isn't replying *at all*. Modern desktop drives try *very hard* to recover bad sectors, under the assumption that they have the only copy of the data. Typically, they'll work at it for two *minutes* or more. The linux kernel driver will give up after 30 seconds and try to reset the drive. The drive firmware ignores the reset, possibly multiple times, until it is done retrying the original read. When it does finally reset, it is too late--it's been bumped from the array. But the drive didn't really fail, leading to:
quoted
When you, the admin, get around to looking, the drive is idle but apparently fine. (It gains a "pending" sector, which stays until the drive is told to write over that spot.) HTH,It does, thanks for the information :-)
You are welcome. Phil