Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit <hidden>
Date: 2021-01-31 08:10:10
Also in:
linux-mm, linux-s390, lkml
On Jan 30, 2021, at 7:30 PM, Nicholas Piggin [off-list ref] wrote: Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:quoted
From: Nadav Amit <redacted> There are currently (at least?) 5 different TLB batching schemes in the kernel: 1. Using mmu_gather (e.g., zap_page_range()). 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the ongoing deferred TLB flush and flushing the entire range eventually (e.g., change_protection_range()). 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). 4. Batching per-table flushes (move_ptes()). 5. By setting a flag on that a deferred TLB flush operation takes place, flushing when (try_to_unmap_one() on x86). It seems that (1)-(4) can be consolidated. In addition, it seems that (5) is racy. It also seems there can be many redundant TLB flushes, and potentially TLB-shootdown storms, for instance during batched reclamation (using try_to_unmap_one()) if at the same time mmu_gather defers TLB flushes. More aggressive TLB batching may be possible, but this patch-set does not add such batching. The proposed changes would enable such batching in a later time. Admittedly, I do not understand how things are not broken today, which frightens me to make further batching before getting things in order. For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes for each page-table (but not in greater granularity). Can't ClearPageDirty() be called before the flush, causing writes after ClearPageDirty() and before the flush to be lost?Because it's holding the page table lock which stops page_mkclean from cleaning the page. Or am I misunderstanding the question?
Thanks. I understood this part. Looking again at the code, I now understand my confusion: I forgot that the reverse mapping is removed after the PTE is zapped. Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), by performing set_page_dirty() for the batched pages when needed in tlb_finish_mmu() [ i.e., by marking for each batched page whether set_page_dirty() should be issued for that page while collecting them ].
I'll go through the patches a bit more closely when they all come through. Sparc and powerpc of course need the arch lazy mode to get per-page/pte information for operations that are not freeing pages, which is what mmu gather is designed for.
IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE where no previous PTE was set, right?
I wouldn't mind using a similar API so it's less of a black box when reading generic code, but it might not quite fit the mmu gather API exactly (most of these paths don't want a full mmu_gather on stack).
I see your point. It may be possible to create two mmu_gather structs: a small one that only holds the flush information and another that also holds the pages.
quoted
This patch-set therefore performs the following changes: 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather instead of {inc|dec}_tlb_flush_pending(). 2. Avoid TLB flushes if PTE permission is not demoted. 3. Cleans up mmu_gather to be less arch-dependant. 4. Uses mm's generations to track in finer granularity, either per-VMA or per page-table, whether a pending mmu_gather operation is outstanding. This should allow to avoid some TLB flushes when KSM or memory reclamation takes place while another operation such as munmap() or mprotect() is running. 5. Changes try_to_unmap_one() flushing scheme, as the current seems broken to track in a bitmap which CPUs have outstanding TLB flushes instead of having a flag.Putting fixes first, and cleanups and independent patches (like #2) next would help with getting stuff merged and backported.
I tried to do it mostly this way. There are some theoretical races which I did not manage (or try hard enough) to create, so I did not include in the “fixes” section. I will restructure the patch-set according to the feedback. Thanks, Nadav