Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2021-11-25 22:44:08
Also in:
linux-btrfs, linux-fsdevel, lkml
On Thu, Nov 25, 2021 at 11:25:54PM +0100, Andreas Gruenbacher wrote:
On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas [off-list ref] wrote:quoted
On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote:quoted
On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:quoted
+++ b/fs/btrfs/ioctl.c@@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, while (1) { ret = -EFAULT; - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) + if (fault_in_exact_writeable(ubuf + sk_offset, + *buf_size - sk_offset)) break; ret = btrfs_search_forward(root, &key, path, sk->min_transid);Couldn't we avoid all of this nastiness by doing ...I had a similar attempt initially but I concluded that it doesn't work: https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com (local)quoted
@@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path, * problem. Otherwise we'll fault and then copy the buffer in * properly this next time through */ - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { - ret = 0; + ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh)); + if (ret)There is no requirement for the arch implementation to be exact and copy the maximum number of bytes possible. It can fail early while there are still some bytes left that would not fault. The only requirement is that if it is restarted from where it faulted, it makes some progress (on arm64 there is one extra byte).quoted
goto out; - } *sk_offset += sizeof(sh);@@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode, int ret; int num_found = 0; unsigned long sk_offset = 0; + unsigned long next_offset = 0; if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { *buf_size = sizeof(struct btrfs_ioctl_search_header);@@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode, while (1) { ret = -EFAULT; - if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset)) + if (fault_in_writeable(ubuf + sk_offset + next_offset, + *buf_size - sk_offset - next_offset)) break; ret = btrfs_search_forward(root, &key, path, sk->min_transid);@@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode, ret = copy_to_sk(path, &key, sk, buf_size, ubuf, &sk_offset, &num_found); btrfs_release_path(path); - if (ret) + if (ret > 0) + next_offset = ret;So after this point, ubuf+sk_offset+next_offset is writeable by fault_in_writable(). If copy_to_user() was attempted on ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts the copy from ubuf+sk_offset, so it returns exacting the same ret as in the previous iteration.So this means that after a short copy_to_user_nofault(), copy_to_sk() needs to figure out the actual point of failure. We'll have the same problem elsewhere, so this should probably be a generic helper. The alignment hacks are arch specific, so maybe we can have a generic version that assumes no alignment restrictions, with arch-specific overrides. Once we know the exact point of failure, a fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if the failure is pertinent. Once we know that the failure isn't pertinent, we're safe to retry the original fault_in_writeable().
The "exact point of failure" is problematic since copy_to_user() may fail a few bytes before the actual fault point (e.g. by doing an unaligned store). As per Linus' reply, we can work around this by doing a sub-page fault_in_writable(point_of_failure, align) where 'align' should cover the copy_to_user() impreciseness. (of course, fault_in_writable() takes the full size argument but behind the scene it probes the 'align' prefix at sub-page fault granularity) -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel