Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

From: Joonsoo Kim <hidden>
Date: 2014-02-07 08:03:14
Also in: linux-mm

Possibly related (same subject, not in this thread)

2014-02-06 · Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory · Christoph Lameter <hidden>
2014-02-06 · Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory · Joonsoo Kim <hidden>
2014-02-06 · Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory · Nishanth Aravamudan <hidden>
2014-02-06 · Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory · Nishanth Aravamudan <hidden>
2014-02-05 · Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory · Christoph Lameter <hidden>

On Thu, Feb 06, 2014 at 11:28:12AM -0800, Nishanth Aravamudan wrote:

On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:

quoted

On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:

quoted

On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:

quoted

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:

quoted

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

quoted

Thank you for clarifying and providing  a test patch. I ran with this on
the system showing the original problem, configured to have 15GB of
memory.

With your patch after boot:

MemTotal:       15604736 kB
MemFree:         8768192 kB
Slab:            3882560 kB
SReclaimable:     105408 kB
SUnreclaim:      3777152 kB

With Anton's patch after boot:

MemTotal:       15604736 kB
MemFree:        11195008 kB
Slab:            1427968 kB
SReclaimable:     109184 kB
SUnreclaim:      1318784 kB


I know that's fairly unscientific, but the numbers are reproducible.

I don't think the goal of the discussion is to reduce the amount of slab 
allocated, but rather get the most local slab memory possible by use of 
kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
which is probably cpu_to_node() for a cpu bound to a node without memory, 
my patch is allocating it on the most local node; Anton's patch is 
allocating it on whatever happened to be the cpu slab.

quoted

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c

@@ -2278,10 +2278,14 @@ redo:

 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		if (unlikely(!node_present_pages(node)))
+			node = numa_mem_id();
+		if (!node_match(page, node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}

Semantically, and please correct me if I'm wrong, this patch is saying
if we have a memoryless node, we expect the page's locality to be that
of numa_mem_id(), and we still deactivate the slab if that isn't true.
Just wanting to make sure I understand the intent.

Yeah, the default policy should be to fallback to local memory if the node 
passed is memoryless.

quoted

What I find odd is that there are only 2 nodes on this system, node 0
(empty) and node 1. So won't numa_mem_id() always be 1? And every page
should be coming from node 1 (thus node_match() should always be true?)

The nice thing about slub is its debugging ability, what is 
/sys/kernel/slab/cache/objects showing in comparison between the two 
patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Hello,

I have no memoryless system, so, to debug it, I need your help. :)
First, please let me know node information on your system.

[    0.000000] Node 0 Memory:
[    0.000000] Node 1 Memory: 0x0-0x200000000

[    0.000000] On node 0 totalpages: 0
[    0.000000] On node 1 totalpages: 131072
[    0.000000]   DMA zone: 112 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 131072 pages, LIFO batch:1

[    0.638391] Node 0 CPUs: 0-199
[    0.638394] Node 1 CPUs:

Do you need anything else?

quoted

I'm preparing 3 another patches which are nearly same with previous patch,
but slightly different approach. Could you test them on your system?
I will send them soon.

Test results are in the attached tarball [1].

quoted

And I think that same problem exists if CONFIG_SLAB is enabled. Could you
confirm that?

I will test and let you know.

Ok, with your patches applied and CONFIG_SLAB enabled:

MemTotal:        8264640 kB
MemFree:         7119680 kB
Slab:             207232 kB
SReclaimable:      32896 kB
SUnreclaim:       174336 kB

For reference, same kernel with CONFIG_SLUB:

MemTotal:        8264640 kB
MemFree:         4264000 kB
Slab:            3065408 kB
SReclaimable:     104704 kB
SUnreclaim:      2960704 kB


Hello,

First of all, thanks for testing!

My patch only affects CONFIG_SLUB. Request to test on CONFIG_SLAB is just
for reference. It seems that my patches doesn't have any effect to your case.
Could you check that numa_mem_id() and get_numa_mem() returns correctly?
I think that numa_mem_id() for all cpus and get_numa_mem() for all nodes
should return 1 on your system.

I will investigate further on my side.

Thanks!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help