Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with... | linux-btrfs

Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

From: Andreas Gruenbacher <agruenba@redhat.com>
Date: 2021-11-29 13:36:07
Also in: linux-arm-kernel, linux-fsdevel, lkml

On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas [off-list ref] wrote:

On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:

quoted

On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas [off-list ref] wrote:

quoted

That's similar, somehow, to the arch-specific probing in one of my
patches: [1]. We could do the above if we can guarantee that the maximum
error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
even if aligned: reads of large blocks are done in 4 * 16 loads, and if
one of them fails e.g. because of the 16-byte sub-page fault, no write
is done, hence such larger than 16 delta).

If you want something in the generic fault_in_writeable(), we probably
need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
increments. But I thought I'd rather keep this in the arch-specific code.

I see, that's even crazier than I'd thought. The looping / probing is
still pretty generic, so I'd still consider putting it in the generic
code.

In the arm64 probe_subpage_user_writeable(), the loop is skipped if
!system_supports_mte() (static label). It doesn't make much difference
for search_ioctl() in terms of performance but I'd like the arch code to
dynamically decide when to probe. An arch_has_subpage_faults() static
inline function would solve this.

However, the above assumes that the only way of probing is by doing a
get_user/put_user(). A non-destructive probe with MTE would be to read
the actual tags in memory and compare them with the top byte of the
pointer.

There's the CHERI architecture as well. Although very early days for
arm64, we do have an incipient port (https://www.morello-project.org/).
The void __user * pointers are propagated inside the kernel as 128-bit
capabilities. A fault_in() would check whether the address (bottom
64-bit) is within the range and permissions specified in the upper
64-bit of the capability. There is no notion of sub-page fault
granularity here and no need to do a put_user() as the check is just
done on the pointer/capability.

Given the above, my preference is to keep the probing arch-specific.

quoted

We also still have fault_in_safe_writeable which is more difficult to
fix, and fault_in_readable which we don't want to leave behind broken,
either.

fault_in_safe_writeable() can be done by using get_user() instead of
put_user() for arm64 MTE and probably SPARC ADI (an alternative is to
read the in-memory tags and compare them with the pointer).

So we'd keep the existing fault_in_safe_writeable() logic for the
actual fault-in and use get_user() to check for sub-page faults? If
so, then that should probably also be hidden in arch code.

For CHERI, that's different again since the fault_in_safe_writeable capability
encodes the read/write permissions independently.

However, do we actually want to change the fault_in_safe_writeable() and
fault_in_readable() functions at this stage? I could not get any of them
to live-lock, though I only tried btrfs, ext4 and gfs2. As per the
earlier discussion, normal files accesses are guaranteed to make
progress. The only problematic one was O_DIRECT which seems to be
alright for the above filesystems (the fs either bails out after several
attempts or uses GUP to read which skips the uaccess altogether).

Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress
is guaranteed because failures are at a byte granularity.

O_DIRECT reads and writes happen in device block size granularity, but
the pages are grabbed with get_user_pages() before the copying
happens. So by the time the copying happens, the pages are guaranteed
to be resident, and we don't need to loop around fault_in_*().

You've mentioned before that copying to/from struct page bypasses
sub-page fault checking. If that is the case, then the checking
probably needs to happen in iomap_dio_bio_iter and dio_refill_pages
instead.

Happy to address them if there is a real concern, I just couldn't trigger it.

Hopefully it should now be clear why you couldn't. One way of
reproducing with fault_in_safe_writeable() would be to use that in
btrfs instead of fault_in_writeable(), of course.

We're not doing any chunked reads from user space with page faults
disabled as far as I'm aware right now, so we probably don't have a
reproducer for fault_in_readable(). It would still be worth fixing
fault_in_readable() to prevent things from blowing up very
unexpectedly later, though.

Thanks,
Andreas

quoted

Of course, the above fault_in_writeable() still needs the btrfs
search_ioctl() counterpart toget_user_pages change the probing on the actual fault
address or offset.

Yes, but that change is relatively simple and it eliminates the need
for probing the entire buffer, so it's a good thing. Maybe you want to
add this though:

--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c

@@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
        unsigned long sk_offset = 0;
-       char __user *fault_in_addr;
+       char __user *fault_in_addr, *end;

@@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
        fault_in_addr = ubuf;
+       end = ubuf + *buf_size;
        while (1) {
                ret = -EFAULT;
-               if (fault_in_writeable(fault_in_addr,
-                                      *buf_size - (fault_in_addr - ubuf)))
+               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
                        break;

Thanks, I'll add it.

--
Catalin

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help