Re: [PATCH 00/49] Automatic NUMA Balancing v10

[PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 06/49] mm: Count the number of pages affected in change_protection() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 18/49] mm: mempolicy: Check for misplaced page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 36/49] mm: numa: Introduce last_nid to the page frame · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Wanpeng Li <hidden> · 2013-01-05
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Mel Gorman <mgorman@suse.de> · 2013-01-07
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Wanpeng Li <hidden> · 2013-01-05
[PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 30/49] mm: numa: Migrate on reference policy · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Simon Jeons <hidden> · 2013-01-04
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Mel Gorman <mgorman@suse.de> · 2013-01-07
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Wanpeng Li <hidden> · 2013-01-08
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Wanpeng Li <hidden> · 2013-01-08
[PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 25/49] mm: numa: Add fault driven placement and migration · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 25/49] mm: numa: Add fault driven placement and migration · Simon Jeons <hidden> · 2013-01-04
[PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Simon Jeons <hidden> · 2013-01-05
Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Mel Gorman <mgorman@suse.de> · 2013-01-07
[PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 17/49] mm: mempolicy: Add MPOL_NOOP · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 12/49] mm: numa: pte_numa() and pmd_numa() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 11/49] mm: numa: define _PAGE_NUMA · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 04/49] x86/mm: Introduce pte_accessible() · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-07
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-09
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Kirill A. Shutemov <hidden> · 2012-12-09
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
[PATCH] sched: Fix task_numa_fault() + KSM crash · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH] sched: Fix task_numa_fault() + KSM crash · Srikar Dronamraju <hidden> · 2012-12-13
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-17
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-13

From: Ingo Molnar <mingo@kernel.org>
Date: 2012-12-10 11:39:54
Also in: lkml

* Mel Gorman [off-list ref] wrote:

On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:

quoted

* Mel Gorman [off-list ref] wrote:

quoted

This is a full release of all the patches so apologies for the 
flood. [...]

I have yet to process all your mails, but assuming I address all 
your review feedback and the latest unified tree in tip:master 
shows no regression in your testing, would you be willing to 
start using it for ongoing work?

Ingo,

If you had read the second paragraph of the mail you just responded to or
the results at the end then you would have seen that I had problems with
the performance. [...]

I've posted a (NUMA-placement sensitive workload centric) 
performance comparisons between "balancenuma", AutoNUMA and 
numa/core unified-v3 to:

   https://lkml.org/lkml/2012/12/7/331

I tried to address all performance regressions you and others 
have reported.

Here's the direct [bandwidth] comparison of 'balancenuma v10' to 
my -v3 tree:

                            balancenuma  | NUMA-tip
 [test unit]            :          -v10  |    -v3
------------------------------------------------------------
 2x1-bw-process         :         6.136  |  9.647:  57.2%
 3x1-bw-process         :         7.250  | 14.528: 100.4%
 4x1-bw-process         :         6.867  | 18.903: 175.3%
 8x1-bw-process         :         7.974  | 26.829: 236.5%
 8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
 16x1-bw-process        :         5.592  | 29.294: 423.9%
 4x1-bw-thread          :        13.598  | 19.290:  41.9%
 8x1-bw-thread          :        16.356  | 26.391:  61.4%
 16x1-bw-thread         :        24.608  | 29.557:  20.1%
 32x1-bw-thread         :        25.477  | 30.232:  18.7%
 2x3-bw-thread          :         8.785  | 15.327:  74.5%
 4x4-bw-thread          :         6.366  | 27.957: 339.2%
 4x6-bw-thread          :         6.287  | 27.877: 343.4%
 4x8-bw-thread          :         5.860  | 28.439: 385.3%
 4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
 3x3-bw-thread          :         8.235  | 21.560: 161.8%
 5x5-bw-thread          :         5.762  | 26.081: 352.6%
 2x16-bw-thread         :         5.920  | 23.269: 293.1%
 1x32-bw-thread         :         5.828  | 18.985: 225.8%
 numa02-bw              :        29.054  | 31.431:   8.2%
 numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
 numa01-bw-thread       :        20.338  | 28.607:  40.7%
 numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
------------------------------------------------------------

I also tried to reproduce and fix as many bugs you reported as 
possible - but my point is that it would be _much_ better if we 
actually joined forces.

[...] You would also know that tip/master testing for the last 
week was failing due to a boot problem (issue was in mainline 
not tip and has been already fixed) and would have known that 
since the -v18 release that numacore was effectively disabled 
on my test machine.

I'm glad it's fixed.

Clearly you are not reading the bug reports you are receiving 
and you're not seeing the small bit of review feedback or 
answering the review questions you have received either. Why 
would I be more forthcoming when I feel that it'll simply be 
ignored? [...]

I am reading the bug reports and addressing bugs as I can.

[...]  You simply assume that each batch of patches you place 
on top must be fixing all known regressions and ignoring any 
evidence to the contrary.

If you had read my mail from last Tuesday you would even know 
which patch was causing the problem that effectively disabled 
numacore although not why. The comment about p->numa_faults 
was completely off the mark (long journey, was tired, assumed 
numa_faults was a counter and not a pointer which was 
careless).  If you had called me on it then I would have 
spotted the actual problem sooner. The problem was indeed with 
the nr_cpus_allowed == num_online_cpus()s check which I had 
pointed out was a suspicious check although for different 
reasons. As it turns out, a printk() bodge showed that 
nr_cpus_allowed == 80 set in sched_init_smp() while 
num_online_cpus() == 48. This effectively disabling numacore. 
If you had responded to the bug report, this would likely have 
been found last Wednesday.

Does changing it from num_online_cpus() to num_possible_cpus() 
help? (Can send a patch if you want.)

quoted

It would make it much easier for me to pick up your 
enhancements, fixes, etc.

quoted

Changelog since V9
  o Migration scalability                                             (mingo)

To *really* see migration scalability bottlenecks you need to 
remove the migration-bandwidth throttling kludge from your tree 
(or configure it up very high if you want to do it simple).

Why is it a kludge? I already explained what the rational 
behind the rate limiting was. It's not about scalability, it's 
about mitigating worse-case behaviour and the amount of time 
the kernel spends moving data around which a deliberately 
adverse workload can trigger.  It is unacceptable if during a 
phase change that a process would stall potentially for 
milliseconds (seconds if the node is large enough I guess) 
while the data is being migrated. Here is it again -- 
http://www.spinics.net/lists/linux-mm/msg47440.html . You 
either ignored the mail or simply could not be bothered 
explaining why you thought this was the incorrect decision or 
why the concerns about an adverse workload were unimportant.

I think the stalls could have been at least in part due to the 
scalability bottlenecks that the rate-limiting code has hidden.

If you think of the NUMA migration as a natural part of the 
workload, as a sort of extended cache-miss, and if you assume 
that the scheduler is intelligent about not flip-flopping tasks 
between nodes (which the latest code certainly is), then I don't 
see why the rate of migration should be rate-limited in the VM.

Note that I tried to quantify this effect: the perf bench numa 
testcases start from a practical 'worst-case adverse' workload 
in essence: all pages concentrated on the wrong node, and the 
workload having to migrate all of them over.

We could add a new 'absolutely worst case' testcase, to make it 
behaves sanely?

I have a vague suspicion actually that when you are modelling 
the task->data relationship that you make an implicit 
assumption that moving data has zero or near-zero cost. In 
such a model it would always make sense to move quickly and 
immediately but in practice the cost of moving can exceed the 
performance benefit of accessing local data and lead to 
regressions. It becomes more pronounced if the nodes are not 
fully connected.

I make no such assumption - convergence costs were part of my 
measurements.

quoted

Some (certainly not all) of the performance regressions you 
reported were certainly due to numa/core code hitting the 
migration codepaths as aggressively as the workload demanded 
- and hitting scalability bottlenecks.

How are you so certain? [...]

Hm, I don't think my "some (certainly not all)" statement 
reflected any sort of certainty. So we violently agree about:

[...] How do you not know it's because your code is migrating 
excessively for no good reason because the algorithm has a 
flaw in it? [...]

That's another source - but again not something we should fix by 
hiding it under the carpet via migration bandwidth rate limits, 
right?

[...] Or that the cost of excessive migration is not being 
offset by local data accesses? [...]

That's another possibility.

The _real_ fix is to avoid excessive migration on the CPU and 
memory placement side, not to throttle the basic mechanism 
itself!

I don't exclude the possibility that bandwidth limits might be 
needed - but only if everything else fails. Meanwhile, the 
bandwidth limits were actively hiding scalability bottlenecks, 
which bottlenecks only trigger at higher migration rates.

[...] The critical point to note is that if it really was only 
scalability problems then autonuma would suffer the same 
problems and would be impossible to autonumas performance to 
exceed numacores. This isn't the case making it unlikely the 
scalability is your only problem.

The scheduling patterns are different - so they can hit 
different bottlenecks.

Either way, last night I applied a patch on top of latest 
tip/master to remove the nr_cpus_allowed check so that 
numacore would be enabled again and tested that. In some 
places it has indeed much improved. In others it is still 
regressing badly and in two case, it's corrupting memory -- 
specjbb when THP is enabled crashes when running for single or 
multiple JVMs. It is likely that a zero page is being inserted 
due to a race with migration and causes the JVM to throw a 
null pointer exception. Here is the comparison on the rough 
off-chance you actually read it this time.

Can you still see the JVM crash with the unified -v3 tree?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help