Thread (34 messages) 34 messages, 6 authors, 2021-11-29

Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

From: Andreas Gruenbacher <agruenba@redhat.com>
Date: 2021-11-27 12:42:12
Also in: linux-btrfs, linux-fsdevel, lkml
Subsystem: memory management, memory management - gup (get user pages), the rest · Maintainers: Andrew Morton, David Hildenbrand, Linus Torvalds

On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher [off-list ref] wrote:
On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas [off-list ref] wrote:
quoted
On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
quoted
On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas [off-list ref] wrote:
quoted
As per Linus' reply, we can work around this by doing
a sub-page fault_in_writable(point_of_failure, align) where 'align'
should cover the copy_to_user() impreciseness.

(of course, fault_in_writable() takes the full size argument but behind
the scene it probes the 'align' prefix at sub-page fault granularity)
That doesn't make sense; we don't want fault_in_writable() to fail or
succeed depending on the alignment of the address range passed to it.
If we know that the arch copy_to_user() has an error of say maximum 16
bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
to probe the first 16 bytes rather than 1.
That isn't going to help one bit: [raw_]copy_to_user() is allowed to
copy as little or as much as it wants as long as it follows the rules
documented in include/linux/uaccess.h:

[] If copying succeeds, the return value must be 0.  If some data cannot be
[] fetched, it is permitted to copy less than had been fetched; the only
[] hard requirement is that not storing anything at all (i.e. returning size)
[] should happen only when nothing could be copied.  In other words, you don't
[] have to squeeze as much as possible - it is allowed, but not necessary.

When fault_in_writeable() tells us that an address range is accessible
in principle, that doesn't mean that copy_to_user() will allow us to
access it in arbitrary chunks. It's also not the case that
fault_in_writeable(addr, size) is always followed by
copy_to_user(addr, ..., size) for the exact same address range, not
even in this case.

These alignment restrictions have nothing to do with page or sub-page faults.

I'm also fairly sure that passing in an unaligned buffer will send
search_ioctl into an endless loop on architectures with copy_to_user()
alignment restrictions; there don't seem to be any buffer alignment
checks.
Let me retract that ...

The description in include/linux/uaccess.h leaves out permissible
reasons for fetching/storing less than requested. Thinking about it, if
the address range passed to one of the copy functions includes an
address that faults, it kind of makes sense to allow the copy function
to stop short instead of copying every last byte right up to the address
that fails.

If that's the only reason, then it would be great to have that included
in the description.  And then we can indeed deal with the alignment
effects in fault_in_writeable().
quoted
quoted
copy_to_user_nofault_unaligned() should be further optimized, maybe as
mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
depending on the actual alignment rules; I'm not sure.
[...]
quoted
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
      return 1;
 }

+size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
+{
+     size_t rest = copy_to_user_nofault(to, from, size);
+
+     if (rest) {
+             size_t n;
+
+             for (n = size - rest; n < size; n++) {
+                     if (copy_to_user_nofault(to + n, from + n, 1))
+                             break;
+             }
+             rest = size - n;
+     }
+     return rest;
That's what I was trying to avoid. That's basically a fall-back to byte
at a time copy (we do this in copy_mount_options(); at some point we
even had a copy_from_user_exact() IIRC).
We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned.
It's just not clear that it's worth it.
quoted
Linus' idea (if I got it correctly) was instead to slightly extend the
probing in fault_in_writeable() for the beginning of the buffer from 1
byte to some per-arch range.

I attempted the above here and works ok:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix

but too late to post it this evening, I'll do it in the next day or so
as an alternative to this series.
I've taken a quick look.  Under the assumption that alignment effects
are tied to page / sub-page faults, I think we can really solve this
generically as Willy has proposed.  Maybe as shown below; no need for
arch-specific code.

Thanks,
Andreas
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..a9b3d916b625 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+#define SUBPAGE_FAULT_SIZE 16
+
 /**
  * fault_in_writeable - fault in userspace address range for writing
  * @uaddr: start of address range
@@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 	if (unlikely(size == 0))
 		return 0;
 	if (!PAGE_ALIGNED(uaddr)) {
+		if (SUBPAGE_FAULT_SIZE &&
+		    !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
+			end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
+			if (end - uaddr < size) {
+				if (unlikely(__put_user(0, uaddr) != 0))
+					return size;
+				uaddr = end;
+				if (unlikely(!end))
+					goto out;
+			}
+		}
 		if (unlikely(__put_user(0, uaddr) != 0))
-			return size;
+			goto out;
 		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
 	}
 	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help