Thread (17 messages) 17 messages, 7 authors, 2011-05-02

Re: Bug#624343: linux-image-2.6.38-2-amd64: frequent message "bio too big device md0 (248 > 240)" in kern.log

From: Ben Hutchings <hidden>
Date: 2011-05-02 01:04:18

On Sun, 2011-05-01 at 20:42 -0400, Daniel Kahn Gillmor wrote:
On 05/01/2011 08:00 PM, Ben Hutchings wrote:
quoted
On Sun, 2011-05-01 at 15:06 -0700, Jameson Graef Rollins wrote:
quoted
Hi, Ben.  Can you explain why this is not expected to work?  Which part
exactly is not expected to work and why?
Adding another type of disk controller (USB storage versus whatever the
SSD interface is) to a RAID that is already in use.
 [...]
quoted
The normal state of a RAID set is that all disks are online.  You have
deliberately turned this on its head; the normal state of your RAID set
is that one disk is missing.  This is such a basic principle that most
documentation won't mention it.
This is somewhat worrisome to me.  Consider a fileserver with
non-hotswap disks.  One disk fails in the morning, but the machine is in
production use, and the admin's goals are:

 * minimize downtime,
 * reboot only during off-hours, and
 * minimize the amount of time that the array is spent de-synced.

A responsible admin might reasonably expect to attach a disk via a
well-tested USB or ieee1394 adapter, bring the array back into sync,
announce to the rest of the organization that there will be a scheduled
reboot later in the evening.

Then, at the scheduled reboot, move the disk from the USB/ieee1394
adapter to the direct ATA interface on the machine.

If this sequence of operations is likely (or even possible) to cause
data loss, it should be spelled out in BIG RED LETTERS someplace.
So far as I'm aware, the RAID may stop working, but without loss of data
that's already on disk.
I don't think any of the above steps seem unreasonable, and the set of
goals the admin is attempting to meet are certainly commonplace goals.
quoted
The error is that you changed the I/O capabilities of the RAID while it
was already in use.  But what I was describing as 'correct' was that an
error code was returned, rather than the error condition only being
logged.  If the error condition is not properly propagated then it could
lead to data loss.
How is an admin to know which I/O capabilities to check before adding a
device to a RAID array?  When is it acceptable to mix I/O capabilities?
 Can a RAID array which is not currently being used as a backing store
for a filesystem be assembled of unlike disks?  What if it is then
(later) used as a backing store for a filesystem?
[...]

I think the answers are:
- Not easily
- When the RAID does not have another device on top
- Yes
- Yes
but Neil can correct me on this.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help