Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
From: Andi Kleen <hidden>
Date: 2004-09-21 09:15:25
Also in:
lkml
On Mon, Sep 20, 2004 at 08:30:12PM -0500, Ray Bryant wrote:
Andi Kleen wrote:quoted
On Mon, Sep 20, 2004 at 05:13:34PM -0500, Ray Bryant wrote:quoted
system wide memory allocation policy issue. It seems cleaner to me to keep that all within the scope of the NUMA API rather than hiding details of it here and there in /proc. And we need the full generality of the NUMA API, to, for example:True for cpuset you will need it.quoted
quoted
Well, you just have to change the callers to pass it in. I think computing the interleaving on a offset and perhaps another file identifier is better than having the global counter.In our case that means changing each and every call to page_cache_alloc() to include an appropriate offset. This is a change that richochets through the machine independent code and makes this harder to contain in the NUMA subsystem.I count two callers of page_cache_alloc in 2.6.9rc2 (filemap.c and XFS pagebuf). Hardly seems like a big issue to change them both. Of course getting the offset there might be tricky, but should be doable.Fair enough. Another option I was thinking of was hiding a global counter in page_cache_alloc itself and using it to provide a value for the offset there.
Please don't. Just use an offset and a hash on (dev_t, inode number)
quoted
Any allocation algorithm will have such a worst case, so I'm not too worried. Given ia hash function is not too bad it should be bearable. The nice advantage of the static offset is that it makes benchmarks actually repeatable and is completely locklessI can see the advantages of that. But the state of the page cache is still something we have to deal with for benchmarks.
Umounting the file systems with the data files usually works pretty well. Or longer term if it's a real issue one could write a workload manager that can actually change policies for existing pages. But I'm not sure how such a beast would really work.
quoted
quoted
be MPOL_INTERLEAVE or potentially MPOL_ROUNDROBIN depending on the workload that the system is running.I think I'm still a bit confused by your terminology. I thought the page cache policy was per process? Now you are talking about another global unrelated policy?I'm sorry if this is confusing, personal terminology usually gets in the way. The idea is that just like for the page allocation policy (your current code), if you wanted, you would have a global, default page cache
Having both a per process page cache and a global page cache policy would seem like overkill to me. And having both doesn't make much sense anyways, because when the system admin wants to change the global policy to free memory on nodes he would still need to worry about conflicting per process policies anyways. So as soon as you have process policy you cannot easily change global anymore.
allocation policy, probably set at boot time or shortly thereafter, probably before any (or at least most) page cache pages have been allocated. You could also have a per process policy setting that would override the global policy, for processes that needed it, but I honestly don't have a good case for this except for symmetry with the existing code.
cpusets was the good case for it that you mentioned. Or did I misunderstand you?
quoted
Anyways, I guess you could just add a high flag bit to the mode argument of set_mempolicy. Something like set_mempolicy(MPOL_PAGECACHE | MPOL_INTERLEAVE, nodemask, len) That would work for setting the page cache policy of the current process.That's an idea. Not they way I was planning on doing it, but that would work. I was thinking along the lines of: set_mempolicy(MPOL_INTERLEAVE, nodemask, len, POLICY_PAGECACHE); but either way can be made to work.
That would be set_mempolicy2() essentially because the existing users don't pass this additional argument. I think passing the flags in the first argument is more compatible. -andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>