Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2021-06-14 05:23:05
Also in:
linux-mm, linuxppc-dev, lkml
Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:quoted
Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:quoted
On 6/13/21 5:45 PM, Nicholas Piggin wrote:quoted
Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:quoted
On 6/4/21 6:42 PM, Nicholas Piggin wrote:quoted
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, which is what existing code does, so the patch is effectively a no-op. Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.I am in favor of this approach, but I would be a lot more comfortable with the resulting code if task->active_mm were at least better documented and possibly even guarded by ifdefs.active_mm is fairly well documented in Documentation/active_mm.rst IMO. I don't think anything has changed in 20 years, I don't know what more is needed, but if you can add to documentation that would be nice. Maybe moving a bit of that into .c and .h files?Quoting from that file: - however, we obviously need to keep track of which address space we "stole" for such an anonymous user. For that, we have "tsk->active_mm", which shows what the currently active address space is. This isn't even true right now on x86.From the perspective of core code, it is. x86 might do something crazy with it, but it has to make it appear this way to non-arch code that uses active_mm. Is x86's scheme documented?quoted
With your patch applied: To support all that, the "struct mm_struct" now has two counters: a "mm_users" counter that is how many "real address space users" there are, and a "mm_count" counter that is the number of "lazy" users (ie anonymous users) plus one if there are any real users. isn't even true any more.Well yeah but the active_mm concept hasn't changed. The refcounting change is hopefully reasonably documented?quoted
quoted
quoted
x86 bare metal currently does not need the core lazy mm refcounting, and x86 bare metal *also* does not need ->active_mm. Under the x86 scheme, if lazy mm refcounting were configured out, ->active_mm could become a dangling pointer, and this makes me extremely uncomfortable. So I tend to think that, depending on config, the core code should either keep ->active_mm [1] alive or get rid of it entirely.I don't actually know what you mean. core code needs the concept of an "active_mm". This is the mm that your kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch, active_mm still points to init_mm for kernel threads.Core code does *not* need this concept. First, it's wrong on x86 since at least 4.15. Any core code that actually assumes that ->active_mm is "active" for any sensible definition of the word active is wrong. Fortunately there is no such code. I looked through all active_mm references in core code. We have: kernel/sched/core.c: it's all refcounting, although it's a bit tangled with membarrier. kernel/kthread.c: same. refcounting and membarrier stuff. kernel/exit.c: exit_mm() a BUG_ON(). kernel/fork.c: initialization code and a warning. kernel/cpu.c: cpu offline stuff. wouldn't be needed if active_mm went away. fs/exec.c: nothing of interestI might not have been clear. Core code doesn't need active_mm if active_mm somehow goes away. I'm saying active_mm can't go away because it's needed to support (most) archs that do lazy tlb mm switching. The part I don't understand is when you say it can just go away. How?quoted
I didn't go through drivers, but I maintain my point. active_mm is there for refcounting. So please don't just make it even more confusing -- do your performance improvement, but improve the code at the same time: get rid of active_mm, at least on architectures that opt out of the refcounting.powerpc opts out of the refcounting and can not "get rid of active_mm". Not even in theory.That is to say, it does do a type of reference management that requires active_mm so you can argue it has not entirely opted out of refcounting. But we're not just doing refcounting for the sake of refcounting! That would make no sense. active_mm is required because that's the mm that we have switched to (from core code's perspective), and it is integral to know when to switch to a different mm. See how active_mm is a fundamental concept in core code? It's part of the contract between core code and the arch mm context management calls. reference counting follows from there but it's not the _reason_ for this code. Pretend the reference problem does not exit (whether by refcounting or shootdown or garbage collection or whatever). We still can't remove active_mm! We need it to know how to call into arch functions like switch_mm. I don't know if you just forgot that critical requirement in your above list, or you actually are entirely using x86's mental model for this code which is doing something entirely different that does not need it at all. If that is the case I really don't mind some cleanup or wrapper functions for x86 do entirely do its own thing, but if that's the case you can't criticize core code's use of active_mm due to the current state of x86. It's x86 that needs documentation and cleaning up.
Ah, that must be where your confusion is coming from: x86's switch_mm doesn't use prev anywhere, and the reference scheme it is using appears to be under-documented, although vague references in changelogs suggest it has not actually "opted out" of active_mm refcounting. That's understandable, but please redirect your objections to the proper place. git blame suggests 3d28ebceaffab. Thanks, Nick