Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering | linux-xfs

quoted

On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote:
On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
BTW, called a workload "fsmark" tells us nothing about the workload
being tested - fsmark can do a lot of interesting things. IOWs, you
need to quote the command line for it to be meaningful to anyone...
My bad.

./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072
Ok, so 16 threads, 3125 files per thread, 128k per file, all created
in to the same directory which rolls over when it gets to 22500
files in the directory. Yeah, it generates a bit of memory pressure,
but I think the file sizes are too small to really stress writeback
much. You need to use files that are at least 10MB in size to really
start to mix up the writeback lists and the way they juggle new and
old inodes to try not to starve any particular inode of writeback
bandwidth....

Also, I don't use the "-t <num>" threading mechanism because all it
does is bash on the directory mutex without really improving
parallelism for creates. perf top on my system shows:

           samples  pcnt function                           DSO
             _______ _____ __________________________________ __________________________________

             2799.00  9.3% mutex_spin_on_owner                [kernel.kallsyms]
             2049.00  6.8% copy_user_generic_string           [kernel.kallsyms]
             1912.00  6.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]

A contended mutex as the prime CPU consumer. That's more CPU than
copying 750MB/s of data.

Hence I normally drive parallelism with fsmark by using multiple "-d
<dir>" options, which runs a thread per directory and a workload
unit per directory and so you don't get directory mutex contention
causing serialisation and interference with what you are really
trying to measure....

As I look through the results I have at the moment, the number of
pages written back was simply really low which is why the problem fell
off my radar.
It doesn't take many to completely screw up writeback IO patterns.
Write a few random pages to a 10MB file well before writeback would
get to the file, and instead of getting optimal sequential writeback
patterns when writeback gets to it, we get multiple disjoint IOs
that require multiple seeks to complete.

Slower, less efficient writeback IO causes memory pressure to last
longer and hence more likely to result in kswapd writeback, and it's
just a downward spiral from there....
Yes, I see the negative feedback loop. This has always been a struggle
in that kswapd needs pages from a particular zone to be cleaned and
freed but calling writepage can make things slower. There were
prototypes in the past to give hints to the flusher threads on what
inode and pages to be freed and they were never met with any degree of
satisfaction.

The consensus (amount VM people at least) was as long as that number was
low, it wasn't much of a problem.
Therein lies the problem. You've got storage people telling you
there is an IO problem with memory reclaim, but the mm community
then put their heads together somewhere private, decide it isn't
a problem worth fixing and do nothing. Rinse, lather, repeat.

I expect memory reclaim to play nicely with writeback that is
already in progress. These subsystems do not work in isolation, yet
memory reclaim treats it that way - as though it is the most
important IO submitter and everything else can suffer while memory
reclaim does it's stuff.  Memory reclaim needs to co-ordinate with
writeback effectively for the system as a whole to work well
together.

I know you disagree.
Right, that's because it doesn't have to be a very high number to be
a problem. IO is orders of magnitude slower than the CPU time it
takes to flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost always a
bad flush decision.

Oh, now that is too close to just be a co-incidence. We're getting
significant amounts of random page writeback from the the ends of
the LRUs done by the VM.

<sigh>
Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
but lets me sure because I'm using that figure rather than ftrace to
count writebacks at the moment.
The number in /proc/vmstat is higher. Much higher.  I just ran the
test at 1000 files (only collapsed to ~3000 iops this time because I
ran it on a plain 3.0-rc4 kernel that still has the .writepage
clustering in XFS), and I see:

nr_vmscan_write 6723

after the test. The event trace only capture ~1400 writepage events
from kswapd, but it tends to miss a lot of events as the system is
quite unresponsive at times under this workload - it's not uncommon
to have ssh sessions not echo a character for 10s... e.g: I started
the workload ~11:08:22:
Ok, I'll be looking at nr_vmscan_write as the basis for "badness".
Perhaps you should look at my other reply (and two line "fix") in
the thread about stopping dirty page writeback until after waiting
on pages under writeback.....

A more relevant question is this -
how many pages were reclaimed by kswapd and what percentage is 799
pages of that? What do you consider an acceptable percentage?
I don't care what the percentage is or what the number is. kswapd is
reclaiming pages most of the time without affect IO patterns, and
when that happens I just don't care because it is working just fine.
I do care. I'm looking at some early XFS results here based on a laptop
(4G). For fsmark with the command line above, the number of pages
written back by kswapd was 0. The worst test by far was sysbench using a
particularly large database. The number of writes was 48745 which is
0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
ignore that.

If I run this at 1G and get a similar ratio, I will assume that I
am not reproducing your problem at all unless I know what ratio you
are seeing.
Single threaded writing of files should -never- cause writeback from
the LRUs. If that is happening, then the memory reclaim throttling
is broken. See my other email.

So .... How many pages were reclaimed by kswapd and what percentage
is 799 pages of that?
No idea. That information is long gone....

You answered my second question. You consider 0% to be the acceptable
percentage.
No, I expect memory reclaim to behave nicely with writeback that is
already in progress. This subsystems do not work in isolation - they
need to co-ordinate 

What I care about is what kswapd is doing when it finds dirty pages
and it decides they need to be written back. It's not a problem that
they are found or need to be written, the problem is the utterly
crap way that memory reclaim is throwing the pages at the filesystem.

I'm not sure how to get through to you guys that single, random page
writeback is *BAD*.
It got through. The feedback during discussions on the VM side was
that as long as the percentage was sufficiently low it wasn't a problem
because on occasion, the VM really needs pages from a particular zone.
A solution that addressed both problems has never been agreed on and
energy and time runs out before it gets fixed each time.
<sigh>

And while I'm ranting, when on earth is the issue-writeback-from-
direct-reclaim problem going to be fixed so we can remove the hacks
in the filesystem .writepage implementations to prevent this from
occurring?
Prototyped that too, same thread. Same type of problem, writeback
from direct reclaim should happen so rarely that it should not be
optimised for. See https://lkml.org/lkml/2010/6/11/32
Writeback from direct reclaim crashes systems by causing stack
overruns - that's why we've disabled it. It's not an "optimisation"
problem - it's a _memory corruption_ bug that needs to be fixed.....

At the risk of pissing you off, this isn't new information so I'll
consider myself duly nudged into revisiting.
No, I've had a rant to express my displeasure at the lack of
progress on this front.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help