Re: [PATCH RFC] Ext4: fix deadlock on dirty pages between fault and writeback
From: Michal Hocko <mhocko@kernel.org>
Date: 2018-12-07 07:16:15
Also in:
linux-fsdevel, linux-mm, linux-xfs
Subsystem:
memory management, memory management - core, the rest · Maintainers:
Andrew Morton, David Hildenbrand, Linus Torvalds
On Fri 07-12-18 16:20:51, Dave Chinner wrote:
On Wed, Dec 05, 2018 at 06:06:56PM +0100, Jan Kara wrote:quoted
Added MM people to CC since this starts to be relevant for them. On Fri 30-11-18 07:40:19, Dave Chinner wrote:quoted
On Thu, Nov 29, 2018 at 02:00:02PM +0100, Jan Kara wrote:quoted
On Thu 29-11-18 23:02:53, Dave Chinner wrote:quoted
As it is, this sort of lock vs reclaim inversion should be caught by lockdep - allocations and reclaim contexts are recorded by lockdep we get reports if we do lock A - alloc and then do reclaim - lock A. We've always had problems with false positives from lockdep for these situations where common XFS code can be called from GFP_KERNEL valid contexts as well as reclaim or GFP_NOFS-only contexts, but I don't recall ever seeing such a report for the writeback path....I think for A == page lock, XFS may have the problem (and lockdep won't notice because it does not track page locks). There are some parts of kernel which do GFP_KERNEL allocations under page lock - pte_alloc_one() is one such function which allocates page tables with GFP_KERNEL and gets called with the faulted page locked. And I believe there are others.Where in direct reclaim are we doing writeback to XFS? It doesn't happen, and I've recently proposed we remove ->writepage support from XFS altogether so that memory reclaim never, ever tries to write pages to XFS filesystems, even from kswapd.Direct reclaim will never do writeback but it may still wait for writeback that has been started by someone else. That is enough for the deadlock to happen. But from what you write below you seem to understand that so I just write this comment here so that others don't get confused.quoted
quoted
So direct reclaim from pte_alloc_one() can wait for writeback on page B while holding lock on page A. And if B is just prepared (added to bio, under writeback, unlocked) but not submitted in xfs_writepages() and we block on lock_page(A), we have a deadlock.Fundamentally, doing GFP_KERNEL allocations with a page lock held violates any ordering rules we might have for multiple page locking order. This is asking for random ABBA reclaim deadlocks to occur, and it's not a filesystem bug - that's a bug in the page table code. e.g if we are doing this in a filesystem/page cache context, it's always in ascending page->index order for pages referenced by the inode's mapping. Memory reclaim provides none of these lock ordering guarantees.So this is where I'd like MM people to tell their opinion. Reclaim code tries to avoid possible deadlocks on page lock by always doing trylock on the page. But as this example shows it is not enough once is blocks in wait_on_page_writeback().I think it only does this in a "legacy memcg" case, according to the comment in shrink_page_list. Which is, apparently, a hack around the fact that memcgs didn't used to have dirty page throttling. AFAIA, balance_dirty_pages() has had memcg-based throttling for some time now, so that kinda points to stale reclaim algorithms, right?
Memcg v1 indeed doesn't have any dirty IO throttling and this is a poor's man workaround. We still do not have that AFAIK and I do not know of an elegant way around that. Fortunatelly we shouldn't have that many GFP_KERNEL | __GFP_ACCOUNT allocations under page lock and we can work around this specific one quite easily. I haven't tested this yet but the following should work
diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..59c98eeb0260 100644
--- a/mm/memory.c
+++ b/mm/memory.c@@ -2993,6 +2993,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (!vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW)))
Is there any reliable reproducer? -- Michal Hocko SUSE Labs