Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node... | linux-mm

Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list

From: Anshuman Khandual <hidden>
Date: 2021-08-31 09:57:27
Also in: lkml


On 8/30/21 5:46 PM, Bharata B Rao wrote:

As an example, consider a 4 node system with the following distance
matrix.

Node 0  1  2  3
----------------
0    10 12 32 32
1    12 10 32 32
2    32 32 10 12
3    32 32 12 10

For this case, the node fallback list gets built like this:

Node	Fallback list
---------------------
0	0 1 2 3
1	1 0 3 2
2	2 3 0 1
3	3 2 0 1 <-- Unexpected fallback order

In the fallback list for nodes 2 and 3, the nodes 0 and 1
appear in the same order which results in more allocations
getting satisfied from node 0 compared to node 1.

The effect of this on remote memory bandwidth as seen by stream
benchmark is shown below:

Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

----------------------------------------
		BANDWIDTH (MB/s)
    TEST	Case 1		Case 2
----------------------------------------
    COPY	57479.6		110791.8
   SCALE	55372.9		105685.9
     ADD	50460.6		96734.2
  TRIADD	50397.6		97119.1
----------------------------------------

The bandwidth drop in Case 1 occurs because most of the allocations
get satisfied by node 0 as it appears first in the fallback order
for both nodes 2 and 3.

I am wondering what causes this performance drop here ? Would not the memory
access latency be similar between {2, 3} --->  { 0 } and {2, 3} --->  { 1 },
given both these nodes {0, 1} have same distance from {2, 3} i.e 32 from the
above distance matrix. Even if the preferred node order changes from { 0 } to
{ 1 } for the accessing node { 3 }, it should not change the latency as such.

Is the performance drop here, is caused by excessive allocation on node { 0 }
resulting from page allocation latency instead.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help