Re: [PATCH v6 21/23] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

From: "Nicholas Piggin" <npiggin@gmail.com>
Date: 2024-06-26 01:23:18
Also in: linux-mm, lkml

On Tue Jun 25, 2024 at 3:20 PM AEST, LEROY Christophe wrote:


Le 25/06/2024 à 06:49, Nicholas Piggin a écrit :

quoted

On Tue Jun 25, 2024 at 12:45 AM AEST, Christophe Leroy wrote:

quoted

On book3s/64, the only user of hugepd is hash in 4k mode.

All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.

Rework hash-4k to use contiguous PMD and PUD instead.

In that setup there are only two huge page sizes: 16M and 16G.

16M sits at PMD level and 16G at PUD level.

pte_update doesn't know page size, lets use the same trick as
hpte_need_flush() to get page size from segment properties. That's
not the most efficient way but let's do that until callers of
pte_update() provide page size instead of just a huge flag.

Signed-off-by: Christophe Leroy <redacted>

[snip]

quoted

+static inline unsigned long hash__pte_update(struct mm_struct *mm,
+					 unsigned long addr,
+					 pte_t *ptep, unsigned long clr,
+					 unsigned long set,
+					 int huge)
+{
+	unsigned long old;
+
+	old = hash__pte_update_one(ptep, clr, set);
+
+	if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) {
+		unsigned int psize = get_slice_psize(mm, addr);
+		int nb, i;
+
+		if (psize == MMU_PAGE_16M)
+			nb = SZ_16M / PMD_SIZE;
+		else if (psize == MMU_PAGE_16G)
+			nb = SZ_16G / PUD_SIZE;
+		else
+			nb = 1;
+
+		WARN_ON_ONCE(nb == 1);	/* Should never happen */
+
+		for (i = 1; i < nb; i++)
+			hash__pte_update_one(ptep + i, clr, set);
+	}
  	/* huge pages use the old page table lock */
  	if (!huge)
  		assert_pte_locked(mm, addr);
  
-	old = be64_to_cpu(old_be);
  	if (old & H_PAGE_HASHPTE)
  		hpte_need_flush(mm, addr, ptep, old, huge);

We definitely need a bit more comment and changelog about the atomicity
issues here. I think the plan should be all hash-side access just
operates on PTE[0], which should avoid that whole race. There could be
some cases that don't follow that. Adding some warnings to catch such
things could be good too.

That seems to be the case indeed, as we have the following in 
hash_page_mm():

#ifndef CONFIG_PPC_64K_PAGES
	/*
	 * If we use 4K pages and our psize is not 4K, then we might
	 * be hitting a special driver mapping, and need to align the
	 * address before we fetch the PTE.
	 *
	 * It could also be a hugepage mapping, in which case this is
	 * not necessary, but it's not harmful, either.
	 */
	if (psize != MMU_PAGE_4K)
		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
#endif /* CONFIG_PPC_64K_PAGES */

Yeah, for that one it works (comment needs updating to say that it
*is* necessary). I think that's the main thing but there's other
possible places where it might not hold -- KVM too, not just the
hash refill.

quoted

I'd been meaning to do more on this sooner, sorry. I've started
tinkering with adding a bit of debug code. I'll see if I can help with
adding a bit of comments.

Yes would we very welcome, I guess you'll send it as followup/fixup 
patch to the series ?

Yeah, the basic approach I think is good, so it wouldn't be a
big rework.

quoted

[snip]

quoted

diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c
index 5a2e512e96db..83c3361b358b 100644
--- a/arch/powerpc/mm/book3s64/hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/hugetlbpage.c

@@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
  		/* If PTE permissions don't match, take page fault */
  		if (unlikely(!check_pte_access(access, old_pte)))
  			return 1;
+		/*
+		 * If hash-4k, hugepages use seeral contiguous PxD entries
+		 * so bail out and let mm make the page young or dirty
+		 */
+		if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
+			if (!(old_pte & _PAGE_ACCESSED))
+				return 1;
+			if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY))
+				return 1;
+		}
  
  		/*
  		 * Try to lock the PTE, add ACCESSED and DIRTY if it was

I'm hoping we wouldn't have to do this, if we follow the PTE[0] rule.

But we still need all entries to be updated so that page walker which 
don't know they must use PTE[0] get the right information ?

Ah yeah. Maybe for ACCESSED|DIRTY we can slightly adjust that rule
and apply it to all PTEs. If we can do that then it takes care of
a few other cases too.

Bug what is the consequence of two pte_update racing? Let's say
page_vma_mkclean_one vs setting dirty. Can you end up with some
PTEs dirty and some not?

Thanks,
Nick

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help