Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
From: Nysal Jan K.A. <hidden>
Date: 2024-10-25 18:31:16
Also in:
lkml, llvm
On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote:
[To += Mathieu] "Nysal Jan K.A." [off-list ref] writes:quoted
From: "Nysal Jan K.A" <redacted> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE is not selected, sync_core_before_usermode() is a no-op. In membarrier_mm_sync_core_before_usermode() the compiler does not eliminate redundant branches and the load of mm->membarrier_state for this case as the atomic_read() cannot be optimized away.I was wondering if this was caused by powerpc's arch_atomic_read() which uses asm volatile.
Yes, that's my understanding as well
But replacing arch_atomic_read() with READ_ONCE() makes no difference, presumably because the compiler still can't see that the READ_ONCE() is unnecessary (which is kind of by design).
In READ_ONCE() we cast to a volatile pointer, I think the compiler cannot eliminate the code in that case.
quoted
Here's a snippet of the code generated for finish_task_switch() on powerpc: 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; ....... 1b78c8: cmpdi cr7,r26,0 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> 1b78d0: ld r9,2312(r13) # current 1b78d4: ld r9,1888(r9) # current->mm 1b78d8: cmpd cr7,r26,r9 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> 1b78e0: hwsync 1b78e4: cmplwi cr7,r27,128 ....... 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) 1b7a74: b 1b78e0 <finish_task_switch+0xcc> This was found while analyzing "perf c2c" reports on kernels prior to commit c1753fd02a00 ("mm: move mm_count into its own cache line") where mm_count was false sharing with membarrier_state.So it was causing a noticable performance blip? But isn't anymore?
It was noticeable in that it showed up amongst the top entries in perf c2c reports. There was similar false sharing with other fields that share the cache line with mm_count, so the gains were minimal with just this patch. c1753fd02a00 addresses these cases too.
quoted
There is a minor improvement in the size of finish_task_switch(). The following are results from bloat-o-meter: GCC 7.5.0: ---------- add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch 884 852 -32 GCC 12.2.1: ----------- add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch.isra 852 820 -32GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?
I cross compiled for aarch64 with gcc 14.1.1 and see similar results: add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) Function old new delta get_nohz_timer_target 352 356 +4 e843419@0b02_0000d7e7_408 8 - -8 e843419@01bb_000021d2_868 8 - -8 finish_task_switch.isra 592 548 -44 Total: Before=31013792, After=31013736, chg -0.00%
quoted
LLVM 17.0.6: ------------ add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) Function old new delta rt_mutex_schedule 120 104 -16 finish_task_switch 792 772 -20 Signed-off-by: Nysal Jan K.A <redacted> --- include/linux/sched/mm.h | 2 ++ 1 file changed, 2 insertions(+)diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 07bb8d4181d7..042e60ab853a 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h@@ -540,6 +540,8 @@ enum { static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) + return; if (current->mm != mm) return; if (likely(!(atomic_read(&mm->membarrier_state) &The other option would be to have a completely separate stub, eg: #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { if (current->mm != mm) return; if (likely(!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) return; sync_core_before_usermode(); } #else static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } #endif Not sure what folks prefer. In either case I think it's probably worth a short comment explaining why it's worth the trouble (ie. that the atomic_read() prevents the compiler from doing DCE).
I'll send a v2 with a comment added in there. Thanks for the review. --Nysal