Thread (2 messages) 2 messages, 2 authors, 2008-05-06

Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition

From: Mike Snitzer <hidden>
Date: 2008-05-06 11:58:25
Also in: lkml

On Tue, May 6, 2008 at 2:53 AM, Neil Brown [off-list ref] wrote:
On Wednesday April 2, snitzer@gmail.com wrote:
 > resync via bitmap if faulty's events+1 == bitmap's events_cleared
 >
 > For more background please see:
 > http://marc.info/?l=linux-raid&m=120703208715865&w=2
 >
 > Without this change validate_super() will prevent the previously faulty
 > member from recovering via bitmap, e.g.:

 I can't help thinking that you are misinterpreting something.  I don't
 think there is a clean->dirty transition happening here.
 You could confirm this by using --examine on both devices after the
 messy shutdown and before re-assembling the array.

 Even allowing for that possible confusion, I cannot quite see what is
 going on.
 It is fairly clear from the event counts that the NBD device is marked
 clean, but if this is happening at array-shutdown time, I cannot see
 why md would try to write to the NBD device and thereby detect an
 error...

 Do you have an internal bitmap or a bitmap in an external file?

 In general, I would not like to make decisions based on the
 oddness/evenness of the event counter.  I consider that to be an
 internal implementation detail.  I am happy to make decisions based on
 a difference-of-1.  I need to understand the big picture first though.
Hi Neil,

I definitely could be misinterpreting something.  However, I did
determine that if the write-mostly NBD member of the raid1 becomes
degraded while writing to the raid1 it frequently has an 'events' that
is one less than the 'events_cleared' (of the local raid1 member that
the array gets reassembled with first).  The events indicate the NBD
member is clean and the local member is dirty.

I'm using internal bitmaps.  I've focused on the even->odd
(clean->dirty) transition to rationalize the safety of allowing the
NBD member to be off by one _and_ clean.  That could easily be
superficial but it seems significant.

It looks like bitmap_update_sb()'s incrementing of events_cleared (on
behalf of the local member) could be racing with the fact that the NBD
member becomes faulty (whereby making the array degraded).  This
allows the events_cleared to reflect a clean->dirty transition last
occurred before the array became degraded.  My reasoning is: If it was
a clean->dirty transition the bitmap still has the associated dirty
bit set in the local member's bitmap, so using the bitmap to resync is
valid.

thanks,
Mike
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help