Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
From: Hugh Dickins <hughd@google.com>
Date: 2023-06-30 19:23:03
Also in:
linux-mm, linux-s390, lkml, sparclinux
On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
On Fri, 30 Jun 2023 08:28:54 -0700 (PDT) Hugh Dickins [off-list ref] wrote:quoted
On Fri, 30 Jun 2023, Claudio Imbrenda wrote:quoted
On Tue, 20 Jun 2023 00:51:19 -0700 (PDT) Hugh Dickins [off-list ref] wrote: [...]quoted
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + unsigned int bit, mask; + struct page *page; + + page = virt_to_page(pgtable); + if (mm_alloc_pgste(mm)) { + call_rcu(&page->rcu_head, pte_free_pgste);so is this now going to be used to free page tables instead of page_table_free_rcu?No. All pte_free_defer() is being used for (in this series; and any future use beyond this series will have to undertake its own evaluations) is for the case of removing an empty page table, which used to map a group of PTE mappings of a file, in order to make way for one PMD mapping of the huge page which those scattered pages have now been gathered into. You're worried by that mm_alloc_pgste() block: it's something I didn'tactually no, but thanks for bringing it up :Dquoted
have at all in my first draft, then I thought that perhaps the pgste case might be able to come this way, so it seemed stupid to leave out the handling for it. I hope that you're implying that should be dead code here? Perhaps, that the pgste case corresponds to the case in s390 where THPs are absolutely forbidden? That would be good news for us. Gerald, in his version of this block, added a comment asking: /* * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in * page_table_free_rcu()? * If yes -> need addr parameter here, like in pte_free_tlb(). */ Do you have the answer to that? Neither of us could work it out.this is the thing I'm worried about; removing a page table that was used to map a guest will leave dangling pointers in the gmap that will cause memory corruption (I actually ran into that problem myself for another patchseries). gmap_unlink() is needed to clean up the pointers before they become dangling (and also potentially do some TLB purging as needed)
That's something I would have expected to be handled already via mmu_notifiers, rather than buried inside the page table freeing. If s390 is the only architecture to go that way, and could instead do it via mmu_notifiers, then I think that will be more easily supported in the long term. But I'm writing from a position of very great ignorance: advising KVM on s390 is many dimensions away from what I'm capable of.
the point here is: we need that only for page_table_free_rcu(); all other users of page_table_free() cannot act on guest page tables
I might be wrong, but I think that most users of page_table_free() are merely freeing a page table which had to be allocated up front, but was then found unnecessary (maybe a racing task already inserted one): page tables which were never exposed to actual use.
(because we don't allow THP for KVM guests). and that is why page_table_free() does not do gmap_unlink() currently.
But THP collapse does (or did before this series) use it to free a page table which had been exposed to use. The fact that s390 does not allow THP for KVM guests makes page_table_free(), and this new pte_free_defer(), safe for that; but it feels dangerously coincidental. It's easy to imagine a future change being made, which would stumble over this issue. I have imagined that pte_free_defer() will be useful in future, in the freeing of empty page tables: but s390 may pose a problem there - though perhaps no more of a problem than additionally needing to pass a virtual address down the stack.
quoted
quoted
or will it be used instead of page_table_free?Not always; but yes, this case of removing a page table used page_table_free() before; but now, with the lighter locking, needs to keep the page table valid until the RCU grace period expires.so if I understand correctly your code will, sometimes, under some circumstances, replace what page_table_free() does, but it will never replace page_table_free_rcu()? because in that case there would be no issues
Yes, thanks for confirming: we have no issue here at present, but may do if use of pte_free_defer() is extended to other contexts in future. Would it be appropriate to add a WARN_ON_ONCE around that
quoted
quoted
quoted
+ if (mm_alloc_pgste(mm)) {
in pte_free_defer()? I ask that somewhat rhetorically: that block disappears in the later version I was working on last night (and will return to shortly), in which pte_free_defer() just sets a bit and calls page_table_free(). But I'd like to understand the possibilities better: does mm_alloc_pgste() correspond 1:1 to KVM guest on s390, or does it cover several different possibilities of which KVM guest is one, or am I just confused to be thinking there's any relationship? Thanks, Hugh
quoted
quoted
this is actually quite important for KVM on s390None of us are wanting to break KVM on s390: your guidance appreciated! Thanks, Hugh