Re: [PATCH] Avoiding fragmentation through different allocator

From: Marcelo Tosatti <hidden>
Date: 2005-01-23 01:34:19
Also in: lkml

On Sat, Jan 22, 2005 at 09:48:20PM +0000, Mel Gorman wrote:

On Fri, 21 Jan 2005, Marcelo Tosatti wrote:

quoted

On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:

quoted

<Changelog snipped>

Hi Mel,

I was thinking that it would be nice to have a set of high-order
intensive workloads, and I wonder what are the most common high-order
allocation paths which fail.

Agreed. As I am not fully sure what workloads require high-order
allocations, I updated VMRegress to keep track of the count of
allocations and released 0.11
(http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
use it to track allocations, do the following

1. Download and unpack vmregress
2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
The patch currently requires the modified allocator but I can fix that up
if people want it. Build and deploy the kernel
3. Build vmregress by
  ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
  (or whatever path is appropriate)
  make
4. Load the modules with;
  insmod src/code/vmregress_core.ko
  insmod src/sense/trace_alloccount.ko

This will create a proc entry /proc/vmregress/trace_alloccount that looks
something like;

Allocations (V1)
-----------
KernNoRclm   997453      370       50        0        0        0        0        0        0        0        0
KernRclm      35279        0        0        0        0        0        0        0        0        0        0
UserRclm    9870808        0        0        0        0        0        0        0        0        0        0
Total      10903540      370       50        0        0        0        0        0        0        0        0

Frees
-----
KernNoRclm   590965      244       28        0        0        0        0        0        0        0        0
KernRclm     227100       60        5        0        0        0        0        0        0        0        0
UserRclm    7974200       73       17        0        0        0        0        0        0        0        0
Total      19695805      747      100        0        0        0        0        0        0        0        0

To blank the counters, use

echo 0 > /proc/vmregress/trace_alloccount

Whatever workload we come up with, this proc entry will tell us if it is
exercising high-order allocations right now.

Great, excellent! Thanks.

I plan to spend some time testing and trying to understand the vmregress package 
this week.

quoted

It mostly depends on hardware because most high-order allocations happen
inside device drivers? What are the kernel codepaths which try to do
high-order allocations and fallback if failed?

I'm not sure. I think that the paths we exercise right now will be largely
artifical. For example, you can force order-2 allocations by scping a
large file through localhost (because of the large MTU in that interface).
I have not come up with another meaningful workload that guarentees
high-order allocations yet.

Thoughts and criticism of the following ideas are very much appreciated:

In private conversation with wli (who helped me providing this information) we can 
conjecture the following:

Modern IO devices are capable of doing scatter/gather IO.

There is overhead associated with setting up and managing the scatter/gather tables. 

The benefit of large physically contiguous blocks is the ability to avoid the SG 
management overhead. 

Now the question is: The added overhead of allocating high order blocks through migration 
offsets the overhead of SG IO ? Quantifying that is interesting.

This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and 
device/IO subsystem characteristics.

Also filesystems benefit from big physically contiguous blocks. Quoting wli
"they want bigger blocks and contiguous memory to match bigger blocks..."

I completly agree that your simplified allocator decreases fragmentation which in turn
benefits the system overall. 

This is an area which can be further improved - ie efficiency in reducing fragmentation 
is excellent. 
I sincerely appreciate the work you are doing!

quoted

To measure whether the cost of page migration offsets the ability to be
able to deliver high-order allocations we want a set of meaningful
performance tests?

Bear in mind, there are more considerations. The allocator potentially
makes hotplug problems easier and could be easily tied into any
page-zeroing system. Some of your own benchmarks also implied that the
modified allocator helped some types of workloads which is beneficial in
itself.The last consideration is HugeTLB pages, which I am hoping William
will weigh in.

Right now, I believe that the pool of huge pages is of a fixed size
because of fragmentation difficulties. If we knew we could allocate huge
pages, this pool would not have to be fixed. Some applications will
heavily benefit from this. While databases are the obvious one,
applications with large heaps will also benefit like Java Virtual
Machines. I can dig up papers that measured this on Solaris although I
don't have them at hand right now.

Please.

We know right now that the overhead of this allocator is fairly low
(anyone got benchmarks to disagree) but I understand that page migration
is relatively expensive. The allocator also does not have adverse
CPU+cache affects like migration and the concept is fairly simple.

Agreed.

quoted

Its quite possible that not all unsatisfiable high-order allocations
want to force page migration (which is quite expensive in terms of
CPU/cache). Only migrate on __GFP_NOFAIL ?

I still believe with the allocator, we will only have to migrate in
exceptional circumstances.

Agreed - best scenario is the guaranteed availability of high-order blocks, where 
migration is not necessary.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help