Re: [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)

Andreas Gruenbacher · 2021-05-21T17:46:04+0200

On Fri, May 21, 2021 at 5:23 PM Jan Kara wrote: > On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote: > > On Thu, May 20, 2021 at 3:30 PM Jan Kara wrote: > > > On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote: > > > > Now that we handle self-recursion on the inode glock in gfs2_fault and > > > > gfs2_page_mkwrite, we need to take care of more complex deadlock > > > > scenarios like the following (example by Jan Kara): > > > > > > > > Two independent processes P1, P2. Two files F1, F2, and two mappings M1, > > > > M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO > > > > to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They > > > > can race like: > > > > > > > > P1 P2 > > > > read() read() > > > > gfs2_file_read_iter() gfs2_file_read_iter() > > > > gfs2_file_direct_read() gfs2_file_direct_read() > > > > locks glock of F1 locks glock of F2 > > > > iomap_dio_rw() iomap_dio_rw() > > > > bio_iov_iter_get_pages() bio_iov_iter_get_pages() > > > > > > > > gfs2_fault() gfs2_fault() > > > > tries to grab glock of F2 tries to grab glock of F1 > > > > > > > > Those kinds of scenarios are much harder to reproduce than > > > > self-recursion. > > > > > > > > We deal with such situations by using the LM_FLAG_OUTER flag to mark > > > > "outer" glock taking. Then, when taking an "inner" glock, we use the > > > > LM_FLAG_TRY flag so that locking attempts that don't immediately succeed > > > > will be aborted. In case of a failed locking attempt, we "unroll" to > > > > where the "outer" glock was taken, drop the "outer" glock, and fault in > > > > the first offending user page. This will re-trigger the "inner" locking > > > > attempt but without the LM_FLAG_TRY flag. Once that has happened, we > > > > re-acquire the "outer" glock and retry the original operation. > > > > > > > > Reported-by: Jan Kara > > > > Signed-off-by: Andreas Gruenbacher > > >...

From: Andreas Gruenbacher <agruenba@redhat.com>
Date: 2021-05-21 15:46:21
Also in: linux-fsdevel

On Fri, May 21, 2021 at 5:23 PM Jan Kara [off-list ref] wrote:

On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote:

quoted

On Thu, May 20, 2021 at 3:30 PM Jan Kara [off-list ref] wrote:

quoted

On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:

quoted

Now that we handle self-recursion on the inode glock in gfs2_fault and
gfs2_page_mkwrite, we need to take care of more complex deadlock
scenarios like the following (example by Jan Kara):

Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
can race like:

P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1

Those kinds of scenarios are much harder to reproduce than
self-recursion.

We deal with such situations by using the LM_FLAG_OUTER flag to mark
"outer" glock taking.  Then, when taking an "inner" glock, we use the
LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
will be aborted.  In case of a failed locking attempt, we "unroll" to
where the "outer" glock was taken, drop the "outer" glock, and fault in
the first offending user page.  This will re-trigger the "inner" locking
attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
re-acquire the "outer" glock and retry the original operation.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

...

quoted

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7d88abb4629b..8b26893f8dc6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c

@@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
      vm_fault_t ret = VM_FAULT_LOCKED;
      struct gfs2_holder gh;
      unsigned int length;
+     u16 flags = 0;
      loff_t size;
      int err;

      sb_start_pagefault(inode->i_sb);

-     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
+     if (current_holds_glock())
+             flags |= LM_FLAG_TRY;
+
+     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
      if (likely(!outer_gh)) {
              err = gfs2_glock_nq(&gh);
              if (err) {
                      ret = block_page_mkwrite_return(err);
+                     if (err == GLR_TRYFAILED) {
+                             set_current_needs_retry(true);
+                             ret = VM_FAULT_SIGBUS;
+                     }

I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus()
which raises the SIGBUS signal. So if the application does not ignore
SIGBUS, your retry will be visible to the application and can cause all
sorts of interesting results...

I would have noticed that, but no SIGBUS signals were actually
delivered. So we probably end up in kernelmode_fixup_or_oops() when in
kernel mode, which just does nothing in that case.

Hum, but how would we get there? I don't think fatal_signal_pending() would
return true yet...

Hmm, right ...

quoted

So you probably need to add a new VM_FAULT_
return code that will behave like VM_FAULT_SIGBUS except it will not raise
the signal.

A new VM_FAULT_* flag might make the code easier to read, but I don't
know if we can have one.

Well, this is kernel-internal API and there's still plenty of space in
vm_fault_reason.

That's in the context of the page fault. The other issue is how to
propagate that out through iov_iter_fault_in_readable ->
fault_in_pages_readable -> __get_user, for example. I don't think
there's much of a chance to get an additional error code out of
__get_user and __put_user.

Thanks,
Andreas

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help