Re: Page allocator bottleneck
From: Aaron Lu <hidden>
Date: 2017-09-19 07:24:02
Also in:
linux-mm
On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:
On 18/09/2017 10:44 AM, Aaron Lu wrote:quoted
On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:quoted
On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:quoted
It's nice to have the option to dynamically play with the parameter. But maybe we should also think of changing the default fraction guaranteed to the PCP, so that unaware admins of networking servers would also benefit.I collected some performance data with will-it-scale/page_fault1 process mode on different machines with different pcp->batch sizes, starting from the default 31(calculated by zone_batchsize(), 31 is the standard value for any zone that has more than 1/2MiB memory), then incremented by 31 upwards till 527. PCP's upper limit is 6*batch. An image is plotted and attached: batch_full.png(full here means the number of process started equals to CPU number).To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527), Y-axis is the value of per_process_ops, generated by will-it-scale,
One correction here, Y-axis isn't per_process_ops but per_process_ops * nr_processes. Still, higher is better.
quoted
higher is better.quoted
From the image: - For EX machines, they all see throughput increase with increased batch size and peaked at around batch_size=310, then fall; - For EP machines, Haswell-EP and Broadwell-EP also see throughput increase with increased batch size and peaked at batch_size=279, then fall, batch_size=310 also delivers pretty good result. Skylake-EP is quite different in that it doesn't see any obvious throughput increase after batch_size=93, though the trend is still increasing, but in a very small way and finally peaked at batch_size=403, then fall. Ivybridge EP behaves much like desktop ones. - For Desktop machines, they do not see any obvious changes with increased batch_size. So the default batch size(31) doesn't deliver good enough result, we probbaly should change the default value.Thanks Aaron for sharing your experiment results. That's a good analysis of the effect of the batch value. I agree with your conclusion. From networking perspective, we should reconsider the defaults to be able to reach the increasing NICs linerates. Not only for pcp->batch, but also for pcp->high.
I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.
Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).
My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:
will-it-scale native_queued_spin_lock_slowpath(perf)
pcp->batch=96 15762348 79.95%
pcp->batch=310 19291492 +22.3% 74.87% -5.1%
Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).
Regards,
Aaron Attachments
- 0001-percpu_pagelist_batch-add-a-batch-interface.patch [text/plain] 3764 bytes · preview