Re: numa/core regressions fixed - more testers wanted
From: Andrew Theurer <hidden>
Date: 2012-11-21 01:54:24
Also in:
lkml
On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote:
* Ingo Molnar [off-list ref] wrote:quoted
( The 4x JVM regression is still an open bug I think - I'll re-check and fix that one next, no need to re-report it, I'm on it. )So I tested this on !THP too and the combined numbers are now: | [ SPECjbb multi-4x8 ] | [ tx/sec ] v3.7 | numa/core-v16 [ higher is better ] ----- | ------------- | +THP: 639k | 655k +2.5% -THP: 510k | 517k +1.3% So it's not a regression anymore, regardless of whether THP is enabled or disabled. The current updated table of performance results is: ------------------------------------------------------------------------- [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7] [ lower is better ] ----- -------- | ------------- ----------- | numa01 340.3 192.3 | 139.4 +144.1% numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0% numa02 56.1 25.3 | 17.5 +220.5% | [ SPECjbb transactions/sec ] | [ higher is better ] | | SPECjbb 1x32 +THP 524k 507k | 638k +21.7% SPECjbb 1x32 !THP 395k | 512k +29.6% | ----------------------------------------------------------------------- | [ SPECjbb multi-4x8 ] | [ tx/sec ] v3.7 | numa/core-v16 [ higher is better ] ----- | ------------- | +THP: 639k | 655k +2.5% -THP: 510k | 517k +1.3% So I think I've addressed all regressions reported so far - if anyone can still see something odd, please let me know so I can reproduce and fix it ASAP.
I can confirm single JVM JBB is working well for me. I see a 30%
improvement over autoNUMA. What I can't make sense of is some perf
stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
tips numa/core:
5,429,632,865 node-loads
3,806,419,082 node-load-misses(70.1%)
2,486,756,884 node-stores
2,042,557,277 node-store-misses(82.1%)
2,878,655,372 node-prefetches
2,201,441,900 node-prefetch-misses
autoNUMA:
4,538,975,144 node-loads
2,666,374,830 node-load-misses(58.7%)
2,148,950,354 node-stores
1,682,942,931 node-store-misses(78.3%)
2,191,139,475 node-prefetches
1,633,752,109 node-prefetch-misses
The percentage of misses is higher for numa/core. I would have expected
the performance increase be due to lower "node-misses", but perhaps I am
misinterpreting the perf data.
One other thing I noticed was both tests are not even using all CPU
(75-80%), so I suspect there's a JVM scalability issue with this
workload at this number of cpu threads (80). This is a IBM JVM, so
there may be some differences. I am curious if any of the others
testing JBB are getting 100% cpu utilization at their warehouse peak.
So, while the performance results are encouraging, I would like to
correlate it with some kind of perf data that confirms why we think it's
better.
Next I'll work on making multi-JVM more of an improvement, and I'll also address any incoming regression reports.
I have issues with multiple KVM VMs running either JBB or
dbench-in-tmpfs, and I suspect whatever I am seeing is similar to
whatever multi-jvm in baremetal is. What I typically see is no real
convergence of a single node for resource usage for any of the VMs. For
example, when running 8 VMs, 10 vCPUs each, a VM may have the following
resource usage:
host cpu usage from cpuacct cgroup:
/cgroup/cpuacct/libvirt/qemu/at-vm01
node00 node01 node02 node03
199056918180|005% 752455339099|020% 1811704146176|049% 888803723722|024%
And VM memory placement in host(in pages):
node00 node01 node02 node03
107566|023% 115245|025% 117807|025% 119414|025%
Conversely, autoNUMA usually has 98+% for cpu and memory in one of the
host nodes for each of these VMs. AutoNUMA is about 30% better in these
tests.
That is data for the entire run time, and "not converged" could possibly
mean, "converged but moved around", but I doubt that's what happening.
Here's perf data for the dbench VMs:
numa/core:
468,634,508 node-loads
210,598,643 node-load-misses(44.9%)
172,735,053 node-stores
107,535,553 node-store-misses(51.1%)
208,064,103 node-prefetches
160,858,933 node-prefetch-misses
autoNUMA:
666,498,425 node-loads
222,643,141 node-load-misses(33.4%)
219,003,566 node-stores
99,243,370 node-store-misses(45.3%)
315,439,315 node-prefetches
254,888,403 node-prefetch-misses
These seems to make a little more sense to me, but the percentages for
autoNUMA still seem a little high (but at least lower then numa/core).
I need to take a manually pinned measurement to compare.
Those of you who would like to test all the latest patches are welcome to pick up latest bits at tip:master: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
I've been running on numa/core, but I'll switch to master and try these again. Thanks, -Andrew Theurer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>