Re: [PATCH] MD: Quickly return errors if too many devices have failed.
From: NeilBrown <hidden>
Date: 2013-03-17 23:49:05
On Wed, 13 Mar 2013 12:29:24 -0500 Jonathan Brassow [off-list ref] wrote:
Neil,
I've noticed that when too many devices fail in a RAID arrary that
addtional I/O will hang, yielding an endless supply of:
Mar 12 11:52:53 bp-01 kernel: Buffer I/O error on device md1, logical block 3
Mar 12 11:52:53 bp-01 kernel: lost page write due to I/O error on md1
Mar 12 11:52:53 bp-01 kernel: sector=800 i=3 (null) (null)
(null) (null) 1
This is the third report in as many weeks that mentions that WARN_ON.
The first two where quite different causes.
I think this one is the same as the first one, which means it would be fixed
by
md/raid5: schedule_construction should abort if nothing to do.
which is commit 29d90fa2adbdd9f in linux-next.
Mar 12 11:52:53 bp-01 kernel: ------------[ cut here ]------------ Mar 12 11:52:53 bp-01 kernel: WARNING: at drivers/md/raid5.c:354 init_stripe+0x2d4/0x370 [raid456]()
Are other people seeing this, or is this an artifact of the way I am killing
devices ('echo offline > /sys/block/$dev/device/state')?That is a perfectly good way to kill a device.
I would prefer to get immediate errors if nothing can be done to satisfy the request and I've been thinking of something like the attached patch. The patch below is incomplete. It does not take into account any reshaping that is going on, nor does it try to figure out if a mirror set in RAID10 has died; but I hope it gets the basic idea across. Is this a good way to handle this situation, or am I missing something?
I think we do get immediate errors (once all bugs are fixed). Your patch does extra work for every request which is only of value if the array has failed - and it really doesn't make sense to optimise for a failed array. The current approach is to just try to satisfy a request and once we find that we need to do something that is impossible - return an error at that point. I think that is best. Can you try the commit I identified and see if it makes the problem go away? Thanks, NeilBrown
Attachments
- signature.asc [application/pgp-signature] 828 bytes