Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2017-07-24 11:25:50
On Mon, 24 Jul 2017 14:28:02 +1000 Benjamin Herrenschmidt [off-list ref] wrote:
quoted hunk ↗ jump to hunk
Instead of comparing the whole CPU mask every time, let's keep a counter of how many bits are set in the mask. Thus testing for a local mm only requires testing if that counter is 1 and the current CPU bit is set in the mask. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ arch/powerpc/include/asm/mmu_context.h | 9 +++++++++ arch/powerpc/include/asm/tlb.h | 11 ++++++++++- arch/powerpc/mm/mmu_context_book3s64.c | 2 ++ 4 files changed, 24 insertions(+), 1 deletion(-)diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 1a220cdff923..c3b00e8ff791 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h@@ -83,6 +83,9 @@ typedef struct { mm_context_id_t id; u16 user_psize; /* page size index */ + /* Number of bits in the mm_cpumask */ + atomic_t active_cpus; + /* NPU NMMU context */ struct npu_context *npu_context;diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index ff1aeb2cd19f..cf8f50cd4030 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h@@ -96,6 +96,14 @@ static inline void switch_mm_pgdir(struct task_struct *tsk, struct mm_struct *mm) { } #endif +#ifdef CONFIG_PPC_BOOK3S_64 +static inline void inc_mm_active_cpus(struct mm_struct *mm) +{ + atomic_inc(&mm->context.active_cpus); +} +#else +static inline void inc_mm_active_cpus(struct mm_struct *mm) { } +#endif
This is a bit awkward. Can we just move the entire function to test cpumask and set / increment into helper functions and define them together with mm_is_thread_local, so it's all in one place? The extra atomic does not need to be defined when it's not used either. Also does it make sense to define it based on NR_CPUS > BITS_PER_LONG? If it's <= then it should be similar load and compare, no? Looks like a good optimisation though. Thanks, Nick