Thread (84 messages) 84 messages, 6 authors, 2013-01-08

Re: [PATCH 00/49] Automatic NUMA Balancing v10

From: Ingo Molnar <mingo@kernel.org>
Date: 2012-12-10 11:39:54
Also in: lkml

* Mel Gorman [off-list ref] wrote:
On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
quoted
* Mel Gorman [off-list ref] wrote:
quoted
This is a full release of all the patches so apologies for the 
flood. [...]
I have yet to process all your mails, but assuming I address all 
your review feedback and the latest unified tree in tip:master 
shows no regression in your testing, would you be willing to 
start using it for ongoing work?
Ingo,

If you had read the second paragraph of the mail you just responded to or
the results at the end then you would have seen that I had problems with
the performance. [...]
I've posted a (NUMA-placement sensitive workload centric) 
performance comparisons between "balancenuma", AutoNUMA and 
numa/core unified-v3 to:

   https://lkml.org/lkml/2012/12/7/331

I tried to address all performance regressions you and others 
have reported.

Here's the direct [bandwidth] comparison of 'balancenuma v10' to 
my -v3 tree:

                            balancenuma  | NUMA-tip
 [test unit]            :          -v10  |    -v3
------------------------------------------------------------
 2x1-bw-process         :         6.136  |  9.647:  57.2%
 3x1-bw-process         :         7.250  | 14.528: 100.4%
 4x1-bw-process         :         6.867  | 18.903: 175.3%
 8x1-bw-process         :         7.974  | 26.829: 236.5%
 8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
 16x1-bw-process        :         5.592  | 29.294: 423.9%
 4x1-bw-thread          :        13.598  | 19.290:  41.9%
 8x1-bw-thread          :        16.356  | 26.391:  61.4%
 16x1-bw-thread         :        24.608  | 29.557:  20.1%
 32x1-bw-thread         :        25.477  | 30.232:  18.7%
 2x3-bw-thread          :         8.785  | 15.327:  74.5%
 4x4-bw-thread          :         6.366  | 27.957: 339.2%
 4x6-bw-thread          :         6.287  | 27.877: 343.4%
 4x8-bw-thread          :         5.860  | 28.439: 385.3%
 4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
 3x3-bw-thread          :         8.235  | 21.560: 161.8%
 5x5-bw-thread          :         5.762  | 26.081: 352.6%
 2x16-bw-thread         :         5.920  | 23.269: 293.1%
 1x32-bw-thread         :         5.828  | 18.985: 225.8%
 numa02-bw              :        29.054  | 31.431:   8.2%
 numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
 numa01-bw-thread       :        20.338  | 28.607:  40.7%
 numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
------------------------------------------------------------

I also tried to reproduce and fix as many bugs you reported as 
possible - but my point is that it would be _much_ better if we 
actually joined forces.
[...] You would also know that tip/master testing for the last 
week was failing due to a boot problem (issue was in mainline 
not tip and has been already fixed) and would have known that 
since the -v18 release that numacore was effectively disabled 
on my test machine.
I'm glad it's fixed.
Clearly you are not reading the bug reports you are receiving 
and you're not seeing the small bit of review feedback or 
answering the review questions you have received either. Why 
would I be more forthcoming when I feel that it'll simply be 
ignored? [...]
I am reading the bug reports and addressing bugs as I can.
[...]  You simply assume that each batch of patches you place 
on top must be fixing all known regressions and ignoring any 
evidence to the contrary.

If you had read my mail from last Tuesday you would even know 
which patch was causing the problem that effectively disabled 
numacore although not why. The comment about p->numa_faults 
was completely off the mark (long journey, was tired, assumed 
numa_faults was a counter and not a pointer which was 
careless).  If you had called me on it then I would have 
spotted the actual problem sooner. The problem was indeed with 
the nr_cpus_allowed == num_online_cpus()s check which I had 
pointed out was a suspicious check although for different 
reasons. As it turns out, a printk() bodge showed that 
nr_cpus_allowed == 80 set in sched_init_smp() while 
num_online_cpus() == 48. This effectively disabling numacore. 
If you had responded to the bug report, this would likely have 
been found last Wednesday.
Does changing it from num_online_cpus() to num_possible_cpus() 
help? (Can send a patch if you want.)
quoted
It would make it much easier for me to pick up your 
enhancements, fixes, etc.
quoted
Changelog since V9
  o Migration scalability                                             (mingo)
To *really* see migration scalability bottlenecks you need to 
remove the migration-bandwidth throttling kludge from your tree 
(or configure it up very high if you want to do it simple).
Why is it a kludge? I already explained what the rational 
behind the rate limiting was. It's not about scalability, it's 
about mitigating worse-case behaviour and the amount of time 
the kernel spends moving data around which a deliberately 
adverse workload can trigger.  It is unacceptable if during a 
phase change that a process would stall potentially for 
milliseconds (seconds if the node is large enough I guess) 
while the data is being migrated. Here is it again -- 
http://www.spinics.net/lists/linux-mm/msg47440.html . You 
either ignored the mail or simply could not be bothered 
explaining why you thought this was the incorrect decision or 
why the concerns about an adverse workload were unimportant.
I think the stalls could have been at least in part due to the 
scalability bottlenecks that the rate-limiting code has hidden.

If you think of the NUMA migration as a natural part of the 
workload, as a sort of extended cache-miss, and if you assume 
that the scheduler is intelligent about not flip-flopping tasks 
between nodes (which the latest code certainly is), then I don't 
see why the rate of migration should be rate-limited in the VM.

Note that I tried to quantify this effect: the perf bench numa 
testcases start from a practical 'worst-case adverse' workload 
in essence: all pages concentrated on the wrong node, and the 
workload having to migrate all of them over.

We could add a new 'absolutely worst case' testcase, to make it 
behaves sanely?
I have a vague suspicion actually that when you are modelling 
the task->data relationship that you make an implicit 
assumption that moving data has zero or near-zero cost. In 
such a model it would always make sense to move quickly and 
immediately but in practice the cost of moving can exceed the 
performance benefit of accessing local data and lead to 
regressions. It becomes more pronounced if the nodes are not 
fully connected.
I make no such assumption - convergence costs were part of my 
measurements.
quoted
Some (certainly not all) of the performance regressions you 
reported were certainly due to numa/core code hitting the 
migration codepaths as aggressively as the workload demanded 
- and hitting scalability bottlenecks.
How are you so certain? [...]
Hm, I don't think my "some (certainly not all)" statement 
reflected any sort of certainty. So we violently agree about:
[...] How do you not know it's because your code is migrating 
excessively for no good reason because the algorithm has a 
flaw in it? [...]
That's another source - but again not something we should fix by 
hiding it under the carpet via migration bandwidth rate limits, 
right?
[...] Or that the cost of excessive migration is not being 
offset by local data accesses? [...]
That's another possibility.

The _real_ fix is to avoid excessive migration on the CPU and 
memory placement side, not to throttle the basic mechanism 
itself!

I don't exclude the possibility that bandwidth limits might be 
needed - but only if everything else fails. Meanwhile, the 
bandwidth limits were actively hiding scalability bottlenecks, 
which bottlenecks only trigger at higher migration rates.
[...] The critical point to note is that if it really was only 
scalability problems then autonuma would suffer the same 
problems and would be impossible to autonumas performance to 
exceed numacores. This isn't the case making it unlikely the 
scalability is your only problem.
The scheduling patterns are different - so they can hit 
different bottlenecks.
Either way, last night I applied a patch on top of latest 
tip/master to remove the nr_cpus_allowed check so that 
numacore would be enabled again and tested that. In some 
places it has indeed much improved. In others it is still 
regressing badly and in two case, it's corrupting memory -- 
specjbb when THP is enabled crashes when running for single or 
multiple JVMs. It is likely that a zero page is being inserted 
due to a race with migration and causes the JVM to throw a 
null pointer exception. Here is the comparison on the rough 
off-chance you actually read it this time.
Can you still see the JVM crash with the unified -v3 tree?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help