Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2021-11-29 23:00:24
Also in:
linux-arm-kernel, linux-fsdevel, lkml
On Mon, Nov 29, 2021 at 10:40:38AM -0800, Linus Torvalds wrote:
On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas [off-list ref] wrote:quoted
That's what this series does when it probes the whole range in fault_in_writeable(). The main reason was that it's more efficient to do a read than a write on a large range (the latter dirtying the cache lines).The more this thread goes on, the more I'm starting to think that we should just make "fault_in_writable()" (and readable, of course) only really work on the beginning of the area. Not just for the finer-granularity pointer color probing, but for the page probing too.
I have patches for the finer-granularity checking of the beginning of the buffer. They need a bit of testing, so probably posting them tomorrow.
I'm looking at our current fault_in_writeable(), and I'm going (a) it uses __put_user() without range checks, which is really not great
For arm64 at least __put_user() does the access_ok() check. I thought only unsafe_put_user() should skip the checks. If __put_user() can write arbitrary memory, we may have a bigger problem.
(b) it looks like a disaster from another standpoint: essentially user-controlled loop size with no limit checking, no preemption, and no check for fatal signals.
Indeed, the fault_in_*() loop can get pretty long, bounded by how much memory can be faulted in the user process. My patches for now only address the outer loop doing the copy_to_user() as that can be unbounded.
Now, (a) should be fixed with a access_ok() or similar. And (b) can easily be fixed multiple ways, with one option simply just being adding a can_resched() call and checking for fatal signals. But faulting in the whole region is actually fundamentally wrong in low-memory situations - the beginning of the region might be swapped out by the time we get to the end. That's unlikely to be a problem in real life, but it's an example of how it's simply not conceptually sensible. So I do wonder why we don't just say "fault_in_writable will fault in _at_most_ X bytes", and simply limit the actual fault-in size to something reasonable. That solves _all_ the problems. It solves the lack of preemption and fatal signals (by virtue of just limiting the amount of work we do). It solves the low memory situation. And it solves the "excessive dirty cachelines" case too.
I think that would be useful, though it doesn't solve the potential livelock with sub-page faults. We still need the outer loop to handle the copy_to_user() for the whole user buffer and the sub-page probing of the beginning of such buffer (or whenever copy_to_user() failed). IOW, you still fault in the whole buffer eventually. Anyway, I think the sub-page probing and limiting the fault-in are complementary improvements. -- Catalin