Re: Software RAID when it works and when it doesn't
From: Support <hidden>
Date: 2007-10-17 21:53:52
On Tue, 2007-10-16 at 17:57 -0400, Mike Accetta wrote:
Was the disk driver generating any low level errors or otherwise indicating that it might be retrying operations on the bad drive at the time (i.e. console diagnostics)? As Neil mentioned later, the md layer is at the mercy of the low level disk driver. We've observed abysmal RAID1 recovery times on failing SATA disks because all the time is being spent in the driver retrying operations which will never succeed. Also, read errors don't tend to fail the array so when the bad disk is again accessed for some subsequent read the whole hopeless retry process begins anew.
The console was full of errors like: end_request: I/O error, dev sdb, sector 42644555 I don't know what generates those messages. As I asked before but never got an answer, is there a way to do timeouts within the md code so that we are not at the mercy of the lower layer drivers?
I posted a patch about 6 weeks ago which attempts to improve this situation for RAID1 by telling the driver not to retry on failures and giving some weight to read errors for failing the array. Hopefully, Neil is still mulling it over and it or something similar will eventually make it into the main line kernel as a solution for this problem. -- Mike Accetta
Thanks, Alberto