Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation

From: Andi Kleen <hidden>
Date: 2004-09-21 09:15:25
Also in: lkml

On Mon, Sep 20, 2004 at 08:30:12PM -0500, Ray Bryant wrote:

Andi Kleen wrote:

quoted

On Mon, Sep 20, 2004 at 05:13:34PM -0500, Ray Bryant wrote:

quoted

system wide memory allocation policy issue.  It seems cleaner to me to 
keep that all within the scope of the NUMA API rather than hiding details 
of it here and there in /proc.  And we need the full generality of the 
NUMA API, to, for example:


True for cpuset you will need it.

quoted

Well, you just have to change the callers to pass it in. I think
computing the interleaving on a offset and perhaps another file
identifier is better than having the global counter.

In our case that means changing each and every call to page_cache_alloc()
to include an appropriate offset.  This is a change that richochets 
through the machine independent code and makes this harder to contain in 
the NUMA
subsystem.


I count two callers of page_cache_alloc in 2.6.9rc2 (filemap.c and
XFS pagebuf). Hardly seems like a big issue to change them both. 
Of course getting the offset there might be tricky, but should be 
doable.

Fair enough.  Another option I was thinking of was hiding a global counter
in page_cache_alloc itself and using it to provide a value for the offset
there.

Please don't. Just use an offset and a hash on (dev_t, inode number)

quoted

Any allocation algorithm will have such a worst case, so I'm not
too worried. Given ia hash function is not too bad it should
be bearable.

The nice advantage of the static offset is that it makes benchmarks
actually repeatable and is completely lockless

I can see the advantages of that.  But the state of the page cache is still
something we have to deal with for benchmarks.

Umounting the file systems with the data files usually works pretty
well.

Or longer term if it's a real issue one could write a workload manager
that can actually change policies for existing pages. But I'm not 
sure how such a beast would really work.

quoted

be MPOL_INTERLEAVE or potentially MPOL_ROUNDROBIN depending on the 
workload that the system is running.


I think I'm still a bit confused by your terminology.
I thought the page cache policy was per process? Now you
are talking about another global unrelated policy?


I'm sorry if this is confusing, personal terminology usually gets in the 
way.

The idea is that just like for the page allocation policy (your current 
code), if you wanted, you would have a global, default page cache

Having both a per process page cache and a global page cache policy
would seem like overkill to me.

And having both doesn't make much sense anyways, because when the 
system admin wants to change the global policy to free memory
on nodes he would still need to worry about conflicting per process policies 
anyways. So as soon as you have process policy you cannot easily
change global anymore.

allocation policy, probably set at boot time or shortly thereafter, 
probably before any (or at least most) page cache pages have been 
allocated.  You could also have a per process policy setting that would 
override the global policy, for processes that needed it, but I honestly 
don't have a good case for this except for symmetry with the existing code.

cpusets was the good case for it that you mentioned. 
Or did I misunderstand you?

quoted

Anyways, I guess you could just add a high flag bit to the 
mode argument of set_mempolicy. Something like

set_mempolicy(MPOL_PAGECACHE | MPOL_INTERLEAVE, nodemask, len)

That would work for setting the page cache policy of the current
process.

That's an idea.  Not they way I was planning on doing it, but that would
work.  I was thinking along the lines of:

set_mempolicy(MPOL_INTERLEAVE, nodemask, len, POLICY_PAGECACHE);

but either way can be made to work.

That would be set_mempolicy2() essentially because the existing
users don't pass this additional argument. I think passing the flags
in the first argument is more compatible.

-andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help