Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

From: Lee Schermerhorn <hidden>
Date: 2007-08-13 15:47:25
Also in: lkml

On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:

On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:

quoted

On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:

quoted

On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:

quoted

Hi,

Just got a bit of time to take another look at the replicated pagecache
patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
gives me more confidence in the locking now; the new ->fault API makes
MAP_SHARED write faults much more efficient; and a few bugs were found
and fixed.

More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
tests...

<snip>

Hi Lee,

Am sick with the flu for the past few days, so I haven't done much more
work here, but I'll just add some (not very useful) comments....

The get_page_from_freelist hang is quite strange. It would be zone->lock,
which shouldn't have too much contention...

Replication may be putting more stress on some locks. It will cause more
tlb flushing that can not be batched well, which could cause the call_lock
to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
inherit the latency from call_lock. (If this is the case, we could
potentially extend the tlb flushing API slightly to cope better with
unmapping of pages from multiple mm's, but that comes way down the track
when/if replication proves itself!).

tlb flushing AFAIKS should not do the IPI unless it is deadling with a
multithreaded mm... does usex use threads?

Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
apps that get run repeatedly, that are multi-threaded.  This job mix
caught a number of races in my auto-migration patches when
multi-threaded tasks race in the page fault paths.

More below...

quoted

I should note that I was trying to unmap all mappings to the file backed pages
on internode task migration, instead of just the current task's pte's.  However,
I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
was looping trying to unmap pages with mapcounts of several 10s--such as I see
on some page cache pages in my traces.

Replication teardown would still have to unmap all... but that shouldn't
particularly be any worse than, say, page reclaim (except I guess that it
could occur more often).

quoted

Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
again and will let it run over the weekend--or as long as it stays up, which 
ever is shorter :-).

Ah, so it does eventually die? Any hints of why?

No, doesn't die--as in panic.  I was just commenting that I'd leave it
running ...  However [:-(], it DID hang again.  The test window said
that the tests ran for 62h:28m before the screen stopped updating.  In
another window, I was running a script to snap the replication and #
file pages vmstats, along with a timestamp, every 10 minutes.  That
stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
the test.  It still wrote the timestamps [date command] until around 7am
this morning [Monday]--or ~62 hours into test.

So, I do have ~14 hours of replication stats that I can send you or plot
up...

Re: the hang:  again, console was scrolling soft lockups continuously.
Checking the messages file, I see hangs in copy_process(),
smp_call_function [as in prev test], vma_link [from mmap], ...

I also see a number of NaT ["not a thing"] consumptions--ia64 specific
error, probably invalid pointer deref--in swapin_readahead, which my
patches hack.  These might be the cause of the fork/mmap hangs.

Didn't see that in the 8-9Aug runs, so it might be a result of continued
operation after other hangs/problems; or a botch in the rebase to
rc2-mm2.  In any case, I have some work to do there...

quoted

I put a tarball with the rebased series in the Replication directory linked
above, in case you're interested.  I haven't added the patch description for
the new patch yet, but it's pretty simple.  Maybe even correct.

----

Unrelated to the lockups  [I think]:

I forgot to look before I rebooted, but earlier the previous evening, I checked
the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
replications and ~4.8 million "zaps" [collapse of replicated page].  That's around
98% zaps.  Do we need some filter in the fault path to reduce the "thrashing"--if
that's what I'm seeing.

Yep. The current replication patch is very much only infrastructure at
this stage (and is good for stress testing). I feel sure that heuristics
and perhaps tunables would be needed to make the most of it.

Yeah.  I have some ideas to try...

At the end of the 14.5 hours when it stopped dumping vmstats, we were at
~95% zaps.

quoted

A while back I took a look at the Virtual Iron page replication patch.  They had
set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set another
flag for shared executables, if we can detect them, and any shared mapping that has
no writable mappings ?

mapping_writably_mapped would be a good one to try. That may be too
broad in some corner cases where we do want occasionally-written files
or even parts of files to be replicated, but if we were ever to enable
CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would
be the default heuristic.

Still, I appreciate the testing of the "thrashing" case, because with
the mapping_writably_mapped heuristic, it is likely that bugs could
remain lurking even in production workloads on huge systems (because
they will hardly ever get unreplicated).

quoted

I'll try to remember to check the replication statistics after the currently
running test.  If the system stays up, that is.  A quick look < 10 minutes into
the test shows that zaps are now ~84% of replications.  Also, ~47k replicated pages
out of ~287K file pages.

Yeah I guess it can be a little misleading: as time approaches infinity,
zaps will probably approach replications. But that doesn't tell you how
long a replica stayed around and usefully fed CPUs with local memory...

May be able to capture that info with a more invasive patch -- e.g., add
a timestamp to the page struct.  I'll think about it.

And, I'll keep you posted.  Not sure how much time I'll be able to
dedicate to this patch stream.  Got a few others I need to get back
to...

Later,
Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help