Re: raid5 hang on get_active_stripe

From: dean gaudet <hidden>
Date: 2006-05-27 00:28:25

On Sat, 27 May 2006, Neil Brown wrote:

On Friday May 26, dean@arctic.org wrote:

quoted

On Tue, 23 May 2006, Neil Brown wrote:

i applied them against 2.6.16.18 and two days later i got my first hang... 
below is the stripe_cache foo.

thanks
-dean

neemlark:~# cd /sys/block/md4/md/
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255

Thanks.  This narrows it down quite a bit... too much infact:  I can
now say for sure that this cannot possible happen :-)

heheh.  fwiw the box has traditionally been rock solid.. it's ancient 
though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 
w/seagate 400GB disks... i really don't suspect the hardware all that much 
because the freeze seems to be rather consistent as to time of day 
(overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb 
going).  unfortunately it doesn't happen every time... but every time i've 
unstuck the box i've noticed those processes going.

other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3
and one with xfs.

Two things that might be helpful:
  1/ Do you have any other patches on 2.6.16.18 other than the 3 I
    sent you?  If you do I'd like to see them, just in case.

it was just 2.6.16.18 plus the 3 you sent... i attached the .config
(it's rather full -- based off debian kernel .config).

maybe there's a compiler bug:

gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)

  2/ The message.gz you sent earlier with the
          echo t > /proc/sysrq-trigger
     trace in it didn't contain information about md4_raid5 - the 
     controlling thread for that array.  It must have missed out
     due to a buffer overflowing.  Next time it happens, could you
     to get this trace again and see if you can find out what
     what md4_raid5 is going.  Maybe do the 'echo t' several times.
     I think that you need a kernel recompile to make the dmesg
     buffer larger.

ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ...

note that i'm going to include two more patches in this next kernel:

http://lkml.org/lkml/2006/5/23/42
http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch

the first was the Jens Axboe patch you mentioned here recently (for
accounting with i/o barriers)... and the second gets rid of the tcp
treason uncloaked messages.

Thanks for your patience - this must be very frustrating for you.

fortunately i'm the primary user of this box... and the bug doesn't
corrupt anything... and i can unstick it easily :)  so it's not all that
frustrating actually.

-dean

Attachments

config.gz [application/octet-stream] 17866 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help