Re: [PATCH 00/49] Automatic NUMA Balancing v10

[PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 06/49] mm: Count the number of pages affected in change_protection() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 18/49] mm: mempolicy: Check for misplaced page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 36/49] mm: numa: Introduce last_nid to the page frame · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Wanpeng Li <hidden> · 2013-01-05
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Mel Gorman <mgorman@suse.de> · 2013-01-07
Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case. · Wanpeng Li <hidden> · 2013-01-05
[PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 30/49] mm: numa: Migrate on reference policy · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Simon Jeons <hidden> · 2013-01-04
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Mel Gorman <mgorman@suse.de> · 2013-01-07
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Wanpeng Li <hidden> · 2013-01-08
Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats · Wanpeng Li <hidden> · 2013-01-08
[PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 25/49] mm: numa: Add fault driven placement and migration · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 25/49] mm: numa: Add fault driven placement and migration · Simon Jeons <hidden> · 2013-01-04
[PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Simon Jeons <hidden> · 2013-01-05
Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY · Mel Gorman <mgorman@suse.de> · 2013-01-07
[PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 17/49] mm: mempolicy: Add MPOL_NOOP · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 12/49] mm: numa: pte_numa() and pmd_numa() · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 11/49] mm: numa: define _PAGE_NUMA · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte · Mel Gorman <mgorman@suse.de> · 2012-12-07
[PATCH 04/49] x86/mm: Introduce pte_accessible() · Mel Gorman <mgorman@suse.de> · 2012-12-07
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-07
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-09
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Kirill A. Shutemov <hidden> · 2012-12-09
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
[PATCH] sched: Fix task_numa_fault() + KSM crash · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH] sched: Fix task_numa_fault() + KSM crash · Srikar Dronamraju <hidden> · 2012-12-13
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Mel Gorman <mgorman@suse.de> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-17
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Ingo Molnar <mingo@kernel.org> · 2012-12-10
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-11
Re: [PATCH 00/49] Automatic NUMA Balancing v10 · Srikar Dronamraju <hidden> · 2012-12-13

From: Ingo Molnar <mingo@kernel.org>
Date: 2012-12-17 10:33:59
Also in: lkml
Subsystem: memory management, memory management - memory policy and migration, the rest · Maintainers: Andrew Morton, David Hildenbrand, Linus Torvalds

* Mel Gorman [off-list ref] wrote:

quoted

[...] Holding PTL across task_numa_fault is bad, but not 
the bad we're looking for.

No, holding the PTL across task_numa_fault() is fine, 
because this bit got reworked in my tree rather 
significantly, see:

 6030a23a1c66 sched: Move the NUMA placement logic to a 
 worklet

and followup patches.

I believe I see your point. After that patch is applied 
task_numa_fault() is a relatively small function and is no 
longer calling task_numa_placement. Sure, PTL is held longer 
than necessary but not enough to cause real scalability 
issues.

Yes - my motivation for that was three-fold:

1) to push rebalancing into process context and thus make it
   essentially lockless and also potentially preemptable.

2) enable the flip-tasks logic, which relies on taking a
   balancing decision and acting on it immediately. If you are
   in process context then this is doable. If you are in a
   balancing irq context then not so much.

3) to simplify the 2M-emu loop was extra dressing on the cake:
   instead of taking and dropping the PTL 512 times (possibly
   interleaving two threads on the same pmd, both of them
   taking/dropping the same set of locks?), it only takes the
   ptl once.

I'll revive this aspect, it has many positives.

quoted

If the bug is indeed here, it's not obvious. I don't know 
why I'm triggering it or why it only triggers for specjbb 
as I cannot imagine what the JVM would be doing that is 
that weird or that would not have triggered before. Maybe 
we both suffer this type of problem but that numacores 
rate of migration is able to trigger it.

Agreed.

I spent some more time on this today and the bug is *really* 
hard to trigger or at least I have been unable to trigger it 
today. This begs the question why it triggered three times in 
relatively quick succession separated by a few hours when 
testing numacore on Dec 9th. Other tests ran between the 
failures. The first failure results were discarded. I deleted 
them to see if the same test reproduced it a second time (it 
did).

Of the three times this bug triggered in the last week, two 
were unclear where they crashed but one showed that the bug 
was triggered by the JVMs garbage collector. That at least is 
a corner case and might explain why it's hard to trigger.

I feel extremely bad about how I reported this because even 
though we differ in how we handle faults, I really cannot see 
any difference that would explain this and I've looked long 
enough. Triggering this by the kernel would *have* to be 
something like missing a cache or TLB flush after page tables 
have been modified or during migration but in most way that 
matters we share that logic. Where we differ, it shouldn't 
matter.

Don't worry, I really think you reported a genuine bug, even if 
it's hard to hit.

FWIW, numacore pulled yesterday completed the same tests 
without any error this time but none of the commits since Dec 
9th would account for fixing it.

Correct. I think chances are that it's still latent. Either 
fixed in your version of the code, which will be hard to 
reconstruct - or it's an active upstream bug.

I'd not blame it on the JVM for a good while - JVMs are one of 
the most abused pieces of code on the planet, literally running 
millions of applications on thousands of kernel variants.

Could you try the patch below on latest upstream with 
CONFIG_NUMA_BALANCING=y, it increases migration bandwidth 
10-fold - does it make it easier to trigger the bug on the now 
upstream NUMA-balancing feature?

It will kill throughput on a number of your tests, but it should 
make all the NUMA-specific activities during the JVM test a lot 
more frequent.

Thanks,

	Ingo

diff --git a/mm/migrate.c b/mm/migrate.c
index 32efd80..8699e8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c

@@ -1511,7 +1511,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
 static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 1280 << (20 - PAGE_SHIFT);
 
 /* Returns true if NUMA migration is currently rate limited */
 bool migrate_ratelimited(int node)

--

To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help