Re: [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2021-10-27 19:13:09
Also in:
linux-fsdevel, lkml, ocfs2-devel
On Tue, Oct 26, 2021 at 11:50:04AM -0700, Linus Torvalds wrote:
On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas [off-list ref] wrote:quoted
While more intrusive, I'd rather change copy_page_from_iter_atomic() etc. to take a pointer where to write back an error code.
[...]
That said, the fact that these sub-page faults are always non-recoverable might be a hint to a solution to the problem: maybe we could extend the existing return code with actual negative error numbers. Because for _most_ cases of "copy_to/from_user()" and friends by far, the only thing we look for is "zero for success". We could extend the "number of bytes _not_ copied" semantics to say "negative means fatal", and because there are fairly few places that actually look at non-zero values, we could have a coccinelle script that actually marks those places.
As you already replied, there are some odd places where the returned uncopied of bytes is used. Also for some valid cases like copy_mount_options(), it's likely that it will fall back to byte-at-a-time with MTE since it's a good chance it would hit a fault in a 4K page (not a fast path though). I'd have to go through all the cases and check whether the return value is meaningful. The iter_iov.c functions and their callers also seem to make use of the bytes copied in case they need to call iov_iter_revert() (though I suppose the iov_iter_iovec_advance() would skip the update in case of an error). As an alternative, you mentioned earlier that a per-thread fault status was not feasible on x86 due to races. Was this only for the hw poison case? I think the uaccess is slightly different. We can add a current->non_recoverable_uaccess variable cleared on pagefault_disable(), only set by uaccess faults and checked by the fs code before re-attempting the fault_in(). An interrupt shouldn't do a uaccess (well, if it does a _nofault one, we can detect in_interrupt() in the MTE exception handler). Last time I looked at io_uring it was running in a separate kernel thread, not sure whether this was changed. I don't see what else would be racing with such current->non_recoverable_uaccess variable. If that's doable, I think it's the least intrusive approach. -- Catalin