Re: [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
From: Andreas Gruenbacher <agruenba@redhat.com>
Date: 2021-05-21 15:46:21
Also in:
linux-fsdevel
On Fri, May 21, 2021 at 5:23 PM Jan Kara [off-list ref] wrote:
On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote:quoted
On Thu, May 20, 2021 at 3:30 PM Jan Kara [off-list ref] wrote:quoted
On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:quoted
Now that we handle self-recursion on the inode glock in gfs2_fault and gfs2_page_mkwrite, we need to take care of more complex deadlock scenarios like the following (example by Jan Kara): Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race like: P1 P2 read() read() gfs2_file_read_iter() gfs2_file_read_iter() gfs2_file_direct_read() gfs2_file_direct_read() locks glock of F1 locks glock of F2 iomap_dio_rw() iomap_dio_rw() bio_iov_iter_get_pages() bio_iov_iter_get_pages() <fault in M2> <fault in M1> gfs2_fault() gfs2_fault() tries to grab glock of F2 tries to grab glock of F1 Those kinds of scenarios are much harder to reproduce than self-recursion. We deal with such situations by using the LM_FLAG_OUTER flag to mark "outer" glock taking. Then, when taking an "inner" glock, we use the LM_FLAG_TRY flag so that locking attempts that don't immediately succeed will be aborted. In case of a failed locking attempt, we "unroll" to where the "outer" glock was taken, drop the "outer" glock, and fault in the first offending user page. This will re-trigger the "inner" locking attempt but without the LM_FLAG_TRY flag. Once that has happened, we re-acquire the "outer" glock and retry the original operation. Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>...quoted
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7d88abb4629b..8b26893f8dc6 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c@@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) vm_fault_t ret = VM_FAULT_LOCKED; struct gfs2_holder gh; unsigned int length; + u16 flags = 0; loff_t size; int err; sb_start_pagefault(inode->i_sb); - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh); + if (current_holds_glock()) + flags |= LM_FLAG_TRY; + + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh); if (likely(!outer_gh)) { err = gfs2_glock_nq(&gh); if (err) { ret = block_page_mkwrite_return(err); + if (err == GLR_TRYFAILED) { + set_current_needs_retry(true); + ret = VM_FAULT_SIGBUS; + }I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus() which raises the SIGBUS signal. So if the application does not ignore SIGBUS, your retry will be visible to the application and can cause all sorts of interesting results...I would have noticed that, but no SIGBUS signals were actually delivered. So we probably end up in kernelmode_fixup_or_oops() when in kernel mode, which just does nothing in that case.Hum, but how would we get there? I don't think fatal_signal_pending() would return true yet...
Hmm, right ...
quoted
quoted
So you probably need to add a new VM_FAULT_ return code that will behave like VM_FAULT_SIGBUS except it will not raise the signal.A new VM_FAULT_* flag might make the code easier to read, but I don't know if we can have one.Well, this is kernel-internal API and there's still plenty of space in vm_fault_reason.
That's in the context of the page fault. The other issue is how to propagate that out through iov_iter_fault_in_readable -> fault_in_pages_readable -> __get_user, for example. I don't think there's much of a chance to get an additional error code out of __get_user and __put_user. Thanks, Andreas