Re: [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback

From: Michal Hocko <mhocko@kernel.org>
Date: 2018-12-07 07:16:15
Also in: linux-fsdevel, linux-mm, linux-xfs
Subsystem: memory management, memory management - core, the rest · Maintainers: Andrew Morton, David Hildenbrand, Linus Torvalds

On Fri 07-12-18 16:20:51, Dave Chinner wrote:

On Wed, Dec 05, 2018 at 06:06:56PM +0100, Jan Kara wrote:

quoted

Added MM people to CC since this starts to be relevant for them.

On Fri 30-11-18 07:40:19, Dave Chinner wrote:

quoted

On Thu, Nov 29, 2018 at 02:00:02PM +0100, Jan Kara wrote:

quoted

On Thu 29-11-18 23:02:53, Dave Chinner wrote:

quoted

As it is, this sort of lock vs reclaim inversion should be caught by
lockdep - allocations and reclaim contexts are recorded by lockdep
we get reports if we do lock A - alloc and then do reclaim - lock A.
We've always had problems with false positives from lockdep for
these situations where common XFS code can be called from GFP_KERNEL
valid contexts as well as reclaim or GFP_NOFS-only contexts, but I
don't recall ever seeing such a report for the writeback path....

I think for A == page lock, XFS may have the problem (and lockdep won't
notice because it does not track page locks). There are some parts of
kernel which do GFP_KERNEL allocations under page lock - pte_alloc_one() is
one such function which allocates page tables with GFP_KERNEL and gets
called with the faulted page locked. And I believe there are others.

Where in direct reclaim are we doing writeback to XFS?

It doesn't happen, and I've recently proposed we remove ->writepage
support from XFS altogether so that memory reclaim never, ever
tries to write pages to XFS filesystems, even from kswapd.

Direct reclaim will never do writeback but it may still wait for writeback
that has been started by someone else. That is enough for the deadlock to
happen. But from what you write below you seem to understand that so I just
write this comment here so that others don't get confused.

quoted

So direct reclaim from pte_alloc_one() can wait for writeback on page B
while holding lock on page A. And if B is just prepared (added to bio,
under writeback, unlocked) but not submitted in xfs_writepages() and we
block on lock_page(A), we have a deadlock.

Fundamentally, doing GFP_KERNEL allocations with a page lock
held violates any ordering rules we might have for multiple page
locking order. This is asking for random ABBA reclaim deadlocks to
occur, and it's not a filesystem bug - that's a bug in the page
table code. e.g if we are doing this in a filesystem/page cache
context, it's always in ascending page->index order for pages
referenced by the inode's mapping. Memory reclaim provides none of
these lock ordering guarantees.

So this is where I'd like MM people to tell their opinion. Reclaim code
tries to avoid possible deadlocks on page lock by always doing trylock on
the page. But as this example shows it is not enough once is blocks in
wait_on_page_writeback().

I think it only does this in a "legacy memcg" case, according to the
comment in shrink_page_list. Which is, apparently, a hack around the
fact that memcgs didn't used to have dirty page throttling. AFAIA,
balance_dirty_pages() has had memcg-based throttling for some time
now, so that kinda points to stale reclaim algorithms, right?

Memcg v1 indeed doesn't have any dirty IO throttling and this is a
poor's man workaround. We still do not have that AFAIK and I do not know
of an elegant way around that. Fortunatelly we shouldn't have that many
GFP_KERNEL | __GFP_ACCOUNT allocations under page lock and we can work
around this specific one quite easily. I haven't tested this yet but the
following should work

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..59c98eeb0260 100644
--- a/mm/memory.c
+++ b/mm/memory.c

@@ -2993,6 +2993,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 */
+	if (!vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))

Is there any reliable reproducer?
-- 
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help