Thread (47 messages) 47 messages, 5 authors, 2023-07-10

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

From: Hugh Dickins <hughd@google.com>
Date: 2023-06-30 19:23:03
Also in: linux-mm, linux-s390, lkml, sparclinux

On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
Hugh Dickins [off-list ref] wrote:
quoted
On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
quoted
On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
Hugh Dickins [off-list ref] wrote:

[...]
  
quoted
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	unsigned int bit, mask;
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	if (mm_alloc_pgste(mm)) {
+		call_rcu(&page->rcu_head, pte_free_pgste);  
so is this now going to be used to free page tables
instead of page_table_free_rcu?  
No.

All pte_free_defer() is being used for (in this series; and any future
use beyond this series will have to undertake its own evaluations) is
for the case of removing an empty page table, which used to map a group
of PTE mappings of a file, in order to make way for one PMD mapping of
the huge page which those scattered pages have now been gathered into.

You're worried by that mm_alloc_pgste() block: it's something I didn't
actually no, but thanks for bringing it up :D
quoted
have at all in my first draft, then I thought that perhaps the pgste
case might be able to come this way, so it seemed stupid to leave out
the handling for it.

I hope that you're implying that should be dead code here?  Perhaps,
that the pgste case corresponds to the case in s390 where THPs are
absolutely forbidden?  That would be good news for us.

Gerald, in his version of this block, added a comment asking:
	/*
	 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
	 * page_table_free_rcu()?
	 * If yes -> need addr parameter here, like in pte_free_tlb().
	 */
Do you have the answer to that?  Neither of us could work it out.
this is the thing I'm worried about; removing a page table that was
used to map a guest will leave dangling pointers in the gmap that will
cause memory corruption (I actually ran into that problem myself for
another patchseries).

gmap_unlink() is needed to clean up the pointers before they become
dangling (and also potentially do some TLB purging as needed)
That's something I would have expected to be handled already via
mmu_notifiers, rather than buried inside the page table freeing.

If s390 is the only architecture to go that way, and could instead do
it via mmu_notifiers, then I think that will be more easily supported
in the long term.

But I'm writing from a position of very great ignorance: advising
KVM on s390 is many dimensions away from what I'm capable of.
the point here is: we need that only for page_table_free_rcu(); all
other users of page_table_free() cannot act on guest page tables
I might be wrong, but I think that most users of page_table_free()
are merely freeing a page table which had to be allocated up front,
but was then found unnecessary (maybe a racing task already inserted
one): page tables which were never exposed to actual use.
(because we don't allow THP for KVM guests). and that is why
page_table_free() does not do gmap_unlink() currently.
But THP collapse does (or did before this series) use it to free a
page table which had been exposed to use.  The fact that s390 does
not allow THP for KVM guests makes page_table_free(), and this new
pte_free_defer(), safe for that; but it feels dangerously coincidental.

It's easy to imagine a future change being made, which would stumble
over this issue.  I have imagined that pte_free_defer() will be useful
in future, in the freeing of empty page tables: but s390 may pose a
problem there - though perhaps no more of a problem than additionally
needing to pass a virtual address down the stack.
quoted
quoted
or will it be used instead of page_table_free?  
Not always; but yes, this case of removing a page table used
page_table_free() before; but now, with the lighter locking, needs
to keep the page table valid until the RCU grace period expires.
so if I understand correctly your code will, sometimes, under some
circumstances, replace what page_table_free() does, but it will never
replace page_table_free_rcu()?

because in that case there would be no issues 
Yes, thanks for confirming: we have no issue here at present, but may
do if use of pte_free_defer() is extended to other contexts in future.

Would it be appropriate to add a WARN_ON_ONCE around that
quoted
quoted
quoted
+	if (mm_alloc_pgste(mm)) {
in pte_free_defer()?

I ask that somewhat rhetorically: that block disappears in the later
version I was working on last night (and will return to shortly), in
which pte_free_defer() just sets a bit and calls page_table_free().

But I'd like to understand the possibilities better: does mm_alloc_pgste()
correspond 1:1 to KVM guest on s390, or does it cover several different
possibilities of which KVM guest is one, or am I just confused to be
thinking there's any relationship?

Thanks,
Hugh
quoted
quoted
this is actually quite important for KVM on s390  
None of us are wanting to break KVM on s390: your guidance appreciated!

Thanks,
Hugh
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help