Re: [PATCH] powerpc/mm: Fix RECLAIM_DISTANCE

From: Anton Blanchard <hidden>
Date: 2017-01-31 04:58:29

Hi,

Anton, I think the behaviour looks good. Actually, it's not very
relevant to the issue addressed by the patch. I will reply to
Michael's reply about the reason. There are two nodes in your system
and the memory is expected to be allocated from node-0. If node-0
doesn't have enough free memory, the allocater switches to node-1. It
means we need more stress.

Did you try setting zone_reclaim_mode? Surely we should reclaim local
clean pagecache if enabled?

Anton
--

zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is disabled by default.  For file servers or workloads
that benefit from having their data cached, zone_reclaim_mode should be
left disabled as the caching effect is likely to be more important than
data locality.

zone_reclaim may be enabled if it's known that the workload is partitioned
such that each partition fits within a NUMA node and that accessing remote
memory would cause a measurable performance reduction.  The page allocator
will then reclaim easily reusable pages (those page cache pages that are
currently not used) before allocating off node pages.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for
heap. It's not exceeding the memory capacity (64GB). So page reclaim
in the fast and slow path weren't triggered. It's why the pagecache
wasn't dropped. I think __GFP_THISNODE isn't specified when
page-fault handler tries to allocate page to accomodate the VMA for
the heap.

*Without* the patch applied, I got something as below in the system
where two NUMA nodes and each of them has 64GB memory. Also, I don't
think the patch is going to change the behaviour:

# cat /proc/sys/vm/zone_reclaim_mode 
0

Drop pagecache
Read 8GB file, for pagecache to consume 8GB memory.
Node 0 FilePages:       8496960 kB
taskset -c 0 ./alloc 137438953472       <- 128GB sized heap
Node 0 FilePages:        503424 kB

Eventually, some of swap clusters have been used as well:

# free -m
              total        used        free      shared  buff/cache
available Mem:         130583      129203         861
10         518         297 Swap:         10987        3145        7842

Thanks,
Gavin

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help