Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump

From: Dev Jain <dev.jain@arm.com>
Date: 2025-07-14 09:27:07
Also in: lkml

On 07/07/25 6:14 am, Catalin Marinas wrote:

On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:

quoted

On 04/07/25 4:52 pm, Catalin Marinas wrote:

quoted

On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:

quoted

@@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
   	}
   	table = pmd_offset(pudp, addr);
+	/*
+	 * Isolate the PMD table; in case of race with ptdump, this helps
+	 * us to avoid taking the lock in __pmd_free_pte_page().
+	 *
+	 * Static key logic:
+	 *
+	 * Case 1: If ptdump does static_branch_enable(), and after that we
+	 * execute the if block, then this patches in the read lock, ptdump has
+	 * the write lock patched in, therefore ptdump will never read from
+	 * a potentially freed PMD table.
+	 *
+	 * Case 2: If the if block starts executing before ptdump's
+	 * static_branch_enable(), then no locking synchronization
+	 * will be done. However, pud_clear() + the dsb() in
+	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an

[...]

quoted

I don't get case 2. You want to ensure pud_clear() is observed by the
ptdump code before the pmd_free(). The DSB in the TLB flushing code
ensures some ordering between the pud_clear() and presumably something
that the ptdump code can observe as well. Would that be the mmap
semaphore? However, the read_lock would only be attempted if this code
is seeing the static branch update, which is not guaranteed. I don't
think it even matters since the lock may be released anyway before the
write_lock in ptdump.

For example, you do a pud_clear() above, skip the whole static branch.
The ptdump comes along on another CPU but does not observe the
pud_clear() since there's no synchronisation. It goes ahead and
dereferences it while this CPU does a pmd_free().

The objective is: ptdump must not dereference a freed pagetable.
So for your example, if the static branch is not observed by
pud_free_pmd_page, this means that ptdump will take the write lock
after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
the write lock before execution of flush_tlb_kernel_pagetable completes, we have
executed static_branch_enable(), contradiction).

I don't see why the write lock matters since pud_free_pmd_page() doesn't

True.

take the read lock in the second scenario. What we need is acquire
semantics after the static branch update on the ptdump path but we get
it before we even attempt the write lock.

For simplicity, ignoring the observability of instruction writes and
considering the static branch a variable, if pud_free_pmd_page() did not
observe the static branch update, is the ptdump guaranteed to see the
cleared pud subsequently?

With initial state pud=1 (non-zero), stb=0 (static branch):

P0 (pud_free_pmd_page)		P1 (ptdump)

     W_pud=0			   W_stb=1
     DSB				   barrier/acq
     R_stb=0			   R_pud=?

The write to the static branch on P1 will be ordered after the read of
the branch on P0, so the pud will be seen as 0. It's not even worth
mentioning the semaphore here as the static branch update has enough
barriers for cache flushing and kick_all_cpus_sync().


The other scenario is P0 (pud_free_pmd_page) observing the write to the
static branch (that's case 1 in your comment). This doesn't say anything
about whether P1 (ptdump) sees a clear or valid pud. What we do know is
that P0 will try to acquire (and release) the lock. If P1 already
acquired the write lock, P0 will wait and the state of the pud is
irrelevant (no freeing). Similarly if P1 already completed by the time
P0 takes the lock.

If P0 takes the lock first, the lock release guarantees that the
pud_clear() is seen by the ptdump code _after_ it acquired the lock.


W.r.t. the visibility of the branch update vs pud access, the
combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
maintenance to PoU together with kick_all_cpus_sync() on P1 should
suffice.

I think the logic works (though I'm heavily jetlagged and writing from a
plane) but the comments need to be improved. As described above, case 1
has two sub-cases depending on when P0 runs in relation to the write
lock (before or during/after). And the write lock doesn't matter for
case 2.

quoted

And I can't get my head around memory ordering, it doesn't look sound.
static_branch_enable() I don't think has acquire semantics, at least not
in relation to the actual branch update. The static_branch_unlikely()
test, again, not sure what guarantees it has (I don't think it has any
in relation to memory loads). Maybe you have worked it all out and is
fine but it needs a better explanation and ideally some simple formal
model. Show it's correct with a global variable first and then we can
optimise with static branches.

What do you suggest? As in what do you mean by showing its correct with
a global variable first...and, for the formal model thingy, do you
want mathematical rigor similar to what you explain in [1] :P, because unfortunately
(and quite embarrassingly) I didn't have formal methods in college : )

Neither did I ;) (mostly analogue electronics). I was thinking something
like our cat/herd tools where you can write some simple assembly. It's a
good investment if you want to give it a try.


Will the following proof work -

Proof of correctness: The below diagram represents pud_free_pmd_page
executing on P0 and ptdump executing on P1. Note that, we can ignore
the situation when processes migrate to another CPU, since we will
have extra barriers because of switch_to(), and all of the embedded
barriers that are used in the reasoning of the proof below apply
to the inner shareable domain, and therefore will be observed by
all CPUs, therefore it suffices to prove for the case where
pud_free_pmd_page executes completely on P0 and ptdump
executes completely on P1.

Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the corresponding
instruction to complete. Therefore from here on we do not need to use the term
"observe" in a relative context. Let t_i' (t_i dash) denote the global timestamp
for the corresponding instruction to start. That is, an instruction labelled
with t_i implies that it started at t_i' and finished at t_i.


P0				P1:

W_PUD = 0: t0			x = 1: t2

if (x == 1) {: t7			write lock: t3
	read lock: t6		R_PUD = 1: t4
	read unlock: t8		write unlock: t5
}

Free PUD: t1

We need to prove that ptdump completely finishes before
we free the PUD. Since write unlock has release semantics,
if the write unlock finishes, it is guaranteed that ptdump
has finished => it suffices to prove that t5 < t1'.


R_PUD = 1 => t4 < t0 .... (i)

Because of acquire semantics of down_write(rw_semaphore lock),
t3 < t4' .... (ii)

(i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)

ptdump is executed on a single kernel thread, which implies
that the transition x = 1 -> x = 1 will never happen; that is,
when static_branch_enable() is executed, then x was 0, which
means that the call chain static_key_enable -> static_key_enable_cpuslocked
-> jump_label_update -> jump_label_can_update/ arch_jump_label_transform_apply
-> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be followed.
The emission of dmb(ish) => t2 < t3 ... (iv)

(iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish) ensures that t0 < t7'
=> t2 < t7' => the static branch is observed by P0 always => t6 and t8 exist.

Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with (iii), this gives
us t3 < t6' => the write lock is observed first => t5 < t6 (the read lock cannot
be taken before the write unlock finishes) ...(v)

Release semantics of read unlock => t8 < t1' ...(vi)
Also, trivially t6 < t8...(vii)

Combining v, vi and vii, t5 < t1'. Q.E.D

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help