Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction

From: Vlastimil Babka <hidden>
Date: 2021-05-28 14:38:09
Also in: lkml

On 5/28/21 2:53 PM, Mel Gorman wrote:

On Fri, May 28, 2021 at 01:59:37PM +0200, Vlastimil Babka wrote:

quoted

On 5/25/21 10:01 AM, Mel Gorman wrote:

quoted

This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.
This sysctl only adjusts pcp->high so that zone->lock contention is
potentially reduced but allocation latency during a PCP refill remains
the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

Signed-off-by: Mel Gorman <redacted>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Acked-by: Vlastimil Babka <redacted>

Thanks.

quoted

Documentation nit below:

quoted

@@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+percpu_pagelist_high_fraction
+=============================
+
+This is the fraction of pages in each zone that are allocated for each
+per cpu page list.  The min value for this is 8.  It means that we do
+not allow more than 1/8th of pages in each zone to be allocated in any
+single per_cpu_pagelist.

This, while technically correct (as an upper limit) is somewhat misleading as
the limit for a single per_cpu_pagelist also considers the number of local cpus.

quoted

 This entry only changes the value of hot per
+cpu pagelists. User can specify a number like 100 to allocate 1/100th
+of each zone to each per cpu page list.

This is worse. Anyone trying to reproduce this example on a system with multiple
cpus per node and checking the result will be puzzled.
So I think the part about number of local cpus should be mentioned to avoid
confusion.

Is this any better?

Ack, thanks

quoted hunk ↗ jump to hunk

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e85c2f21d209..2da25735a629 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst

@@ -793,15 +793,16 @@ why oom happens. You can get snapshot.
 percpu_pagelist_high_fraction
 =============================
 
-This is the fraction of pages in each zone that are allocated for each
-per cpu page list.  The min value for this is 8.  It means that we do
-not allow more than 1/8th of pages in each zone to be allocated in any
-single per_cpu_pagelist.  This entry only changes the value of hot per
-cpu pagelists. User can specify a number like 100 to allocate 1/100th
-of each zone to each per cpu page list.
-
-The batch value of each per cpu pagelist remains the same regardless of the
-value of the high fraction so allocation latencies are unaffected.
+This is the fraction of pages in each zone that are can be stored to
+per-cpu page lists. It is an upper boundary that is divided depending
+on the number of online CPUs. The min value for this is 8 which means
+that we do not allow more than 1/8th of pages in each zone to be stored
+on per-cpu page lists. This entry only changes the value of hot per-cpu
+page lists. A user can specify a number like 100 to allocate 1/100th of
+each zone between per-cpu lists.
+
+The batch value of each per-cpu page list remains the same regardless of
+the value of the high fraction so allocation latencies are unaffected.
 
 The initial value is zero. Kernel uses this value to set the high pcp->high
 mark based on the low watermark for the zone and the number of local

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help