Re: RAID 5 performance issue.
From: Andrew Clayton <hidden>
Date: 2007-10-04 18:26:53
On Thu, 4 Oct 2007 12:20:25 -0400 (EDT), Justin Piszcz wrote:
On Thu, 4 Oct 2007, Andrew Clayton wrote:quoted
On Thu, 4 Oct 2007 10:10:02 -0400 (EDT), Justin Piszcz wrote:quoted
Also, did performance just go to crap one day or was it gradual?IIRC I just noticed one day that firefox and vim was stalling. That was back in February/March I think. At the time the server was running a 2.6.18 kernel, since then I've tried a few kernels in between that and currently 2.6.23-rc9 Something seems to be periodically causing a lot of activity that max's out the stripe_cache for a few seconds (when I was trying to look with blktrace, it seemed pdflush was doing a lot of activity during this time). What I had noticed just recently was when I was the only one doing IO on the server (no NFS running and I was logged in at the console) even just patching the kernel was crawling to a halt.quoted
Justin.Cheers, Andrew - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlBesides the NCQ issue your problem is a bit perpelxing.. Just out of curiosity have you run memtest86 for at least one pass to make sure there were no problems with the memory?
No I haven't.
Do you have a script showing all of the parameters that you use to optimize the array?
No script, Nothing that I change really seems to make any difference. Currently I have set /sys/block/md0/md/stripe_cache_size set at 16384 It doesn't really seem to matter what I set it to, as the stripe_cache_active will periodically reach that value and take a few seconds to come back down. /sys/block/sd[bcd]/queue/nr_requests to 512 and set readhead to 8192 on sd[bcd] But none of that really seems to make any difference.
Also mdadm -D /dev/md0 output please?
http://digital-domain.net/kernel/sw-raid5-issue/mdadm-D
What distribution are you running? (not that it should matter, but just curious)
Fedora Core 6 (though I'm fairly sure it was happening before upgrading from Fedora Core 5) The iostat output of the drives when the problem occurs looks like the same profile as when the backup is going onto the USB 1.1 hard drive. The IO wait goes up, the cpu % is hitting 100% and we see multi second await times. Which is why I thought maybe the on board controller was a bottleneck, like the USB 1.1 is really slow and moved the disks onto the PCI card. But when I saw that even patching the kernel was going really slow I thought it can't really be the problem as it didn't used to go that slow. It's a tricky one...
Justin.
Cheers, Andrew