Thread (13 messages) 13 messages, 5 authors, 2014-12-04

Re: Raid5 drive fail during grow and no backup

From: Jason Keltz <hidden>
Date: 2014-11-10 03:20:22

On 07/11/2014 10:36 PM, Phil Turmel wrote:
On 11/07/2014 11:06 AM, P. Gautschi wrote:
quoted
 > This is a problem you haven't solved yet, I think. The raid array
should have fixed this bad sector for you without kicking the drive out.
The scenario is common with "green" drives and/or consumer-grade drives
in general.
 > ...
 > Then you can set up your array to properly correct bad sectors, and
set your system to look for bad sectors on
 > a regular basis.

What is the behavior of mdadm when a disk reports a read error?
- reconstruct the data, deliver it to the fs and otherwise ignore it?
- set the disk to fail?
- reconstruct the data, rewrite the failed data and continue with any
action?
- rewrite the failed data and reread it (bypassing the cache on the HD)?
Option 3.  Reconstruct and rewrite.

However, if the device with the bad sector is trying to recover longer 
than the linux low level driver's timeout, bad things^TM happen. 
Specifically, the driver resets the SATA (or SCSI) connection and 
attempts to reconnect.  During this brief time, it will not accept 
further I/O, so the write back of the reconstructed data fails.  Then 
the device has experienced a *write* error, so MD fails the drive.  
This is the out-of-the-box behavior of consumer-grade drives in raid 
arrays.
Hi Phil,
Sorry to interject..
Since I'm in the midst of setting up a 22 disk RAID 10 with 2 TB WD 
black (desktop) drives, I wanted to be clear that I understand this 
particular scenerio that you bring up.  Should a drive enter a deep 
error recovery, would I be correct that the worst that should happen 
would be a hang for the users during this recovery time, and, if the 
driver does reset the SATA connection (as it likely would do), then a 
potential removal of the disk from the array, but not the destruction of 
the array?  If I had a spare disk, it would be used for a potential 
rebuild, but I could test the original disk and re-add it back to the 
pool at another time.

Any feedback would be helpful.

Thanks!

Jason.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help