Re: Page allocator bottleneck

From: Aaron Lu <hidden>
Date: 2017-09-19 07:24:02
Also in: linux-mm

On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:


On 18/09/2017 10:44 AM, Aaron Lu wrote:

quoted

On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:

quoted

On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:

quoted

It's nice to have the option to dynamically play with the parameter.
But maybe we should also think of changing the default fraction guaranteed
to the PCP, so that unaware admins of networking servers would also benefit.

I collected some performance data with will-it-scale/page_fault1 process
mode on different machines with different pcp->batch sizes, starting
from the default 31(calculated by zone_batchsize(), 31 is the standard
value for any zone that has more than 1/2MiB memory), then incremented
by 31 upwards till 527. PCP's upper limit is 6*batch.

An image is plotted and attached: batch_full.png(full here means the
number of process started equals to CPU number).

To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
Y-axis is the value of per_process_ops, generated by will-it-scale,

One correction here, Y-axis isn't per_process_ops but per_process_ops *
nr_processes. Still, higher is better.

quoted

higher is better.

quoted

 From the image:
- For EX machines, they all see throughput increase with increased batch
   size and peaked at around batch_size=310, then fall;
- For EP machines, Haswell-EP and Broadwell-EP also see throughput
   increase with increased batch size and peaked at batch_size=279, then
   fall, batch_size=310 also delivers pretty good result. Skylake-EP is
   quite different in that it doesn't see any obvious throughput increase
   after batch_size=93, though the trend is still increasing, but in a very
   small way and finally peaked at batch_size=403, then fall.
   Ivybridge EP behaves much like desktop ones.
- For Desktop machines, they do not see any obvious changes with
   increased batch_size.

So the default batch size(31) doesn't deliver good enough result, we
probbaly should change the default value.

Thanks Aaron for sharing your experiment results.
That's a good analysis of the effect of the batch value.
I agree with your conclusion.

From networking perspective, we should reconsider the defaults to be able to
reach the increasing NICs linerates.
Not only for pcp->batch, but also for pcp->high.

I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.

Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).

My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:

                 will-it-scale    native_queued_spin_lock_slowpath(perf)
pcp->batch=96    15762348         79.95%
pcp->batch=310   19291492 +22.3%  74.87% -5.1%

Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).

Regards,
Aaron

Attachments

0001-percpu_pagelist_batch-add-a-batch-interface.patch [text/plain] 3764 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help