Thread (84 messages) 84 messages, 6 authors, 2013-01-08

Re: [PATCH 00/49] Automatic NUMA Balancing v10

From: Ingo Molnar <mingo@kernel.org>
Date: 2012-12-17 10:33:59
Also in: lkml
Subsystem: memory management, memory management - memory policy and migration, the rest · Maintainers: Andrew Morton, David Hildenbrand, Linus Torvalds

* Mel Gorman [off-list ref] wrote:
quoted
quoted
[...] Holding PTL across task_numa_fault is bad, but not 
the bad we're looking for.
No, holding the PTL across task_numa_fault() is fine, 
because this bit got reworked in my tree rather 
significantly, see:

 6030a23a1c66 sched: Move the NUMA placement logic to a 
 worklet

and followup patches.
I believe I see your point. After that patch is applied 
task_numa_fault() is a relatively small function and is no 
longer calling task_numa_placement. Sure, PTL is held longer 
than necessary but not enough to cause real scalability 
issues.
Yes - my motivation for that was three-fold:

1) to push rebalancing into process context and thus make it
   essentially lockless and also potentially preemptable.

2) enable the flip-tasks logic, which relies on taking a
   balancing decision and acting on it immediately. If you are
   in process context then this is doable. If you are in a
   balancing irq context then not so much.

3) to simplify the 2M-emu loop was extra dressing on the cake:
   instead of taking and dropping the PTL 512 times (possibly
   interleaving two threads on the same pmd, both of them
   taking/dropping the same set of locks?), it only takes the
   ptl once.

I'll revive this aspect, it has many positives.
quoted
quoted
If the bug is indeed here, it's not obvious. I don't know 
why I'm triggering it or why it only triggers for specjbb 
as I cannot imagine what the JVM would be doing that is 
that weird or that would not have triggered before. Maybe 
we both suffer this type of problem but that numacores 
rate of migration is able to trigger it.
Agreed.
I spent some more time on this today and the bug is *really* 
hard to trigger or at least I have been unable to trigger it 
today. This begs the question why it triggered three times in 
relatively quick succession separated by a few hours when 
testing numacore on Dec 9th. Other tests ran between the 
failures. The first failure results were discarded. I deleted 
them to see if the same test reproduced it a second time (it 
did).

Of the three times this bug triggered in the last week, two 
were unclear where they crashed but one showed that the bug 
was triggered by the JVMs garbage collector. That at least is 
a corner case and might explain why it's hard to trigger.

I feel extremely bad about how I reported this because even 
though we differ in how we handle faults, I really cannot see 
any difference that would explain this and I've looked long 
enough. Triggering this by the kernel would *have* to be 
something like missing a cache or TLB flush after page tables 
have been modified or during migration but in most way that 
matters we share that logic. Where we differ, it shouldn't 
matter.
Don't worry, I really think you reported a genuine bug, even if 
it's hard to hit.
FWIW, numacore pulled yesterday completed the same tests 
without any error this time but none of the commits since Dec 
9th would account for fixing it.
Correct. I think chances are that it's still latent. Either 
fixed in your version of the code, which will be hard to 
reconstruct - or it's an active upstream bug.

I'd not blame it on the JVM for a good while - JVMs are one of 
the most abused pieces of code on the planet, literally running 
millions of applications on thousands of kernel variants.

Could you try the patch below on latest upstream with 
CONFIG_NUMA_BALANCING=y, it increases migration bandwidth 
10-fold - does it make it easier to trigger the bug on the now 
upstream NUMA-balancing feature?

It will kill throughput on a number of your tests, but it should 
make all the NUMA-specific activities during the JVM test a lot 
more frequent.

Thanks,

	Ingo
diff --git a/mm/migrate.c b/mm/migrate.c
index 32efd80..8699e8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1511,7 +1511,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
 static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 1280 << (20 - PAGE_SHIFT);
 
 /* Returns true if NUMA migration is currently rate limited */
 bool migrate_ratelimited(int node)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help