Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2020-07-16 06:06:19
Also in:
linux-arch, linux-mm, lkml
Excerpts from Andy Lutomirski's message of July 16, 2020 3:18 pm:
quoted
On Jul 15, 2020, at 9:15 PM, Nicholas Piggin [off-list ref] wrote: Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:quoted
----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote:quoted
Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:quoted
Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:quoted
Also, as it stands, I can easily see in_irq() ceasing to promise to serialize. There are older kernels for which it does not promise to serialize. And I have plans to make it stop serializing in the nearish future.You mean x86's return from interrupt? Sounds fun... you'll konw where to update the membarrier sync code, at least :)Oh, I should actually say Mathieu recently clarified a return from interrupt doesn't fundamentally need to serialize in order to support membarrier sync core.Clarification to your statement: Return from interrupt to kernel code does not need to be context serializing as long as kernel serializes before returning to user-space. However, return from interrupt to user-space needs to be context serializing.Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb in the right places. A kernel thread does a use_mm, then it blocks and the user process with the same mm runs on that CPU, and then it calls into the kernel, blocks, the kernel thread runs again, another CPU issues a membarrier which does not IPI this one because it's running a kthread, and then the kthread switches back to the user process (still without having unused the mm), and then the user process returns from syscall without having done a core synchronising instruction. The cause of the problem is you want to avoid IPI'ing kthreads. Why? I'm guessing it really only matters as an optimisation in case of idle threads. Idle thread is easy (well, easier) because it won't use_mm, so you could check for rq->curr == rq->idle in your loop (in a suitable sched accessor function). But... I'm not really liking this subtlety in the scheduler for all this (the scheduler still needs the barriers when switching out of idle). Can it be improved somehow? Let me forget x86 core sync problem for now (that _may_ be a bit harder), and step back and look at what we're doing. The memory barrier case would actually suffer from the same problem as core sync, because in the same situation it has no implicit mmdrop in the scheduler switch code either. So what are we doing with membarrier? We want any activity caused by the set of CPUs/threads specified that can be observed by this thread before calling membarrier is appropriately fenced from activity that can be observed to happen after the call returns. CPU0 CPU1 1. user stuff a. membarrier() 2. enter kernel b. read rq->curr 3. rq->curr switched to kthread c. is kthread, skip IPI 4. switch_to kthread d. return to user 5. rq->curr switched to user thread 6. switch_to user thread 7. exit kernel 8. more user stuff As far as I can see, the problem is CPU1 might reorder step 5 and step 8, so you have mmdrop of lazy mm be a mb after step 6. But why? The membarrier call only cares that there is a full barrier between 1 and 8, right? Which it will get from the previous context switch to the kthread. I must say the memory barrier comments in membarrier could be improved a bit (unless I'm missing where the main comment is). It's fine to know what barriers pair with one another, but we need to know which exact memory accesses it is ordering /* * Matches memory barriers around rq->curr modification in * scheduler. */ Sure, but it doesn't say what else is being ordered. I think it's just the user memory accesses, but would be nice to make that a bit more explicit. If we had such comments then we might know this case is safe. I think the funny powerpc barrier is a similar case of this. If we ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that CPU has or will have issued a memory barrier between running user code. So AFAIKS all this membarrier stuff in kernel/sched/core.c could just go away. Except x86 because thread switch doesn't imply core sync, so CPU1 between 1 and 8 may never issue a core sync instruction the same way a context switch must be a full mb. Before getting to x86 -- Am I right, or way off track here?I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()?
It's not the thread switch but the return from kernel to user -- at least of architectures that implement membarrier SYNC_CORE, x86 can do that without serializing. The thread switch is muddying the waters a bit, it's not the actual thread switch we care about, that just happens to be used as a point where we try to catch the membarrier IPIs that were skipped due to the PF_KTHREAD optimisation. I think that doing said check in the lazy tlb exit code is both unnecessary for the memory ordering and insufficient for pipeline serialization.
But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: int a[2]; Thread A: a[0] = 1; a[1] = 2: Thread B: write(fd, a, sizeof(a)); Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone.
I think kernel accesses probably do matter (or at least they should by principle of least surprise). And so I was doubly misleading by labeling it as "user stuff". I should have distinguished between previous user or kernel accesses, as opposed to the kernel accesses specifically for the implementation of the membarrier call. So I think the membarrier code gets *that* part right (modulo what we have seen already) if the kernel access is being done from process context. But yes if the access is coming from io_uring that has done kthread_use_mm or some other random code running in a kernel thread working on get_user_pages memory or any similar shared vm_insert_pfn memory, then it goes completely to hell. So good catch, PF_KTHREAD check is problematic there even if no actual users exist today. rq->curr == rq->idle test might be better, but can we have interrupts writing completions into user memory? For performance I would hope so, so that makes even that test problematic. Maybe membarrier should close that gap entirely, and work around performance issue by adding _USER_ONLY flags which explicitly only order user mode accesess vs other user accesses. Thanks, Nick