Re: raid5 hang on get_active_stripe
From: dean gaudet <hidden>
Date: 2006-05-27 00:28:25
On Sat, 27 May 2006, Neil Brown wrote:
On Friday May 26, dean@arctic.org wrote:quoted
On Tue, 23 May 2006, Neil Brown wrote: i applied them against 2.6.16.18 and two days later i got my first hang... below is the stripe_cache foo. thanks -dean neemlark:~# cd /sys/block/md4/md/ neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255Thanks. This narrows it down quite a bit... too much infact: I can now say for sure that this cannot possible happen :-)
heheh. fwiw the box has traditionally been rock solid.. it's ancient though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 w/seagate 400GB disks... i really don't suspect the hardware all that much because the freeze seems to be rather consistent as to time of day (overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb going). unfortunately it doesn't happen every time... but every time i've unstuck the box i've noticed those processes going. other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3 and one with xfs.
Two things that might be helpful:
1/ Do you have any other patches on 2.6.16.18 other than the 3 I
sent you? If you do I'd like to see them, just in case.it was just 2.6.16.18 plus the 3 you sent... i attached the .config (it's rather full -- based off debian kernel .config). maybe there's a compiler bug: gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)
2/ The message.gz you sent earlier with the
echo t > /proc/sysrq-trigger
trace in it didn't contain information about md4_raid5 - the
controlling thread for that array. It must have missed out
due to a buffer overflowing. Next time it happens, could you
to get this trace again and see if you can find out what
what md4_raid5 is going. Maybe do the 'echo t' several times.
I think that you need a kernel recompile to make the dmesg
buffer larger.ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ... note that i'm going to include two more patches in this next kernel: http://lkml.org/lkml/2006/5/23/42 http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch the first was the Jens Axboe patch you mentioned here recently (for accounting with i/o barriers)... and the second gets rid of the tcp treason uncloaked messages.
Thanks for your patience - this must be very frustrating for you.
fortunately i'm the primary user of this box... and the bug doesn't corrupt anything... and i can unstick it easily :) so it's not all that frustrating actually. -dean
Attachments
- config.gz [application/octet-stream] 17866 bytes