Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown... | linuxppc-dev

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

From: Yicong Yang <hidden>
Date: 2023-07-05 10:24:23
Also in: linux-arm-kernel, linux-doc, linux-mips, linux-mm, linux-riscv, linux-s390, lkml
Subsystem: memory management, memory management - rmap (reverse mapping), the rest · Maintainers: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Linus Torvalds

On 2023/7/5 16:43, Barry Song wrote:

On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang [off-list ref] wrote:

quoted

On 2023/6/30 1:26, Catalin Marinas wrote:

quoted

On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:

quoted

On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:

quoted

From: Barry Song <redacted>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

[...]

quoted

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++
 arch/arm64/include/asm/tlbflush.h             | 33 ++++++++-
 arch/arm64/mm/flush.c                         | 69 +++++++++++++++++++
 arch/x86/include/asm/tlbflush.h               |  5 +-
 include/linux/mm_types_task.h                 |  4 +-
 mm/rmap.c                                     | 12 ++--

First of all, this patch needs to be split in some preparatory patches
introducing/renaming functions with no functional change for x86. Once
done, you can add the arm64-only changes.

got it. will try to split this patch as suggested.

quoted

Now, on the implementation, I had some comments on v7 but we didn't get
to a conclusion and the thread eventually died:

https://lore.kernel.org/linux-mm/Y7cToj5mWd1ZbMyQ@arm.com/ (local)

I know I said a command line argument is better than Kconfig or some
random number of CPUs heuristics but it would be even better if we don't
bother with any, just make this always on.

ok, will make this always on.

quoted

Barry had some comments
around mprotect() being racy and that's why we have
flush_tlb_batched_pending() but I don't think it's needed (or, for
arm64, it can be a DSB since this patch issues the TLBIs but without the
DVM Sync). So we need to clarify this (see Barry's last email on the
above thread) and before attempting new versions of this patchset. With
flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
implementation would be faster on any SoC irrespective of the number of
CPUs.

I think I got the need for flush_tlb_batched_pending(). If
try_to_unmap() marks the pte !present and we have a pending TLBI,
change_pte_range() will skip the TLB maintenance altogether since it did
not change the pte. So we could be left with stale TLB entries after
mprotect() before TTU does the batch flushing.

Good catch.
This could be also true for MADV_DONTNEED. after try_to_unmap, we run
MADV_DONTNEED on this area, as pte is not present, we don't do anything
on this PTE in zap_pte_range afterwards.

quoted

We can have an arch-specific flush_tlb_batched_pending() that can be a
DSB only on arm64 and a full mm flush on x86.

We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
condition so we first check whether there's a pended batched flush and
if so do the tlb flush. The pending checking is common and the differences
among the archs is how to flush the TLB here within the flush_tlb_batched_pending(),
on arm64 it should only be a dsb.

As we only needs to maintain the TLBs already pended in batched flush,
does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
flush_tlb_batched_pending() and no arch specific function needed.

as we have issued no-sync tlbi on those pending addresses , that means
our hardware
has already "recorded" what should be flushed in the specific mm. so
DSB only will flush
them correctly. right?

yes it's right. I was just thought something like below. arch_tlbbatch_flush()
will only be a dsb on arm64 so this will match what Catalin wants. But as you
told that this maybe incorrect on x86 so we'd better have arch specific
implementation for flush_tlb_batched_pending() as suggested.

diff --git a/mm/rmap.c b/mm/rmap.c
index 9699c6011b0e..afa3571503a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c

@@ -717,7 +717,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
        int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

        if (pending != flushed) {
-               flush_tlb_mm(mm);
+               arch_tlbbatch_flush(&current->tlb_ubc.arch);
                /*
                 * If the new TLB flushing is pending during flushing, leave
                 * mm->tlb_flush_batched as is, to avoid losing flushing.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help