[PATCH v2] mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT

From: Michal Hocko <hidden>
Date: 2012-12-19 15:04:56
Also in: lkml
Subsystem: generic include/asm header files, memory management, memory management - core, mmu gather and tlb invalidation, the rest · Maintainers: Arnd Bergmann, Andrew Morton, David Hildenbrand, Will Deacon, "Aneesh Kumar K.V", Nick Piggin, Peter Zijlstra, Linus Torvalds

On Tue 18-12-12 16:00:30, Andrew Morton wrote:

On Wed, 19 Dec 2012 00:50:42 +0100
Michal Hocko [off-list ref] wrote:

quoted

On Tue 18-12-12 14:02:19, Andrew Morton wrote:

quoted

On Tue, 18 Dec 2012 17:11:28 +0100
Michal Hocko [off-list ref] wrote:

quoted

Since e303297 (mm: extended batches for generic mmu_gather) we are batching
pages to be freed until either tlb_next_batch cannot allocate a new batch or we
are done.

This works just fine most of the time but we can get in troubles with
non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY) on
large machines where too aggressive batching might lead to soft lockups during
process exit path (exit_mmap) because there are no scheduling points down the
free_pages_and_swap_cache path and so the freeing can take long enough to
trigger the soft lockup.

The lockup is harmless except when the system is setup to panic on
softlockup which is not that unusual.

The simplest way to work around this issue is to explicitly cond_resched per
batch in tlb_flush_mmu (1020 pages on x86_64).

...

--- a/mm/memory.c
+++ b/mm/memory.c

@@ -239,6 +239,7 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 	for (batch = &tlb->local; batch; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
 		batch->nr = 0;
+		cond_resched();
 	}
 	tlb->active = &tlb->local;
 }

tlb_flush_mmu() has a large number of callsites (or callsites which
call callers, etc), many in arch code.  It's not at all obvious that
tlb_flush_mmu() is never called from under spinlock?

free_pages_and_swap_cache calls lru_add_drain which in turn calls
put_cpu (aka preempt_enable) which is a scheduling point for
CONFIG_PREEMPT.

No, that inference doesn't work.  Because preempt_enable() inside
spinlock is OK - it will not call schedule() because
current->preempt_count is still elevated (by spin_lock).

Bahh, you are right. I was checking the callsites when patching our
internal kernel and it was really tedious so I thought this would be
easier to show.
Now when thinking about it some more it would be much safer to not
cond_resched unconditionally because this has a potential to blow up at
random places/archs. It sounds much more appropriate to kill the problem
where it started - an unbounded amount of batches. What do you think
about the following?
---

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help