Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2025-08-18 17:55:08
Also in:
bpf, linux-arch, lkml
On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
quoted hunk ↗ jump to hunk
So, I tried to pare back the code and the following (untested) is what I came up with. Given the straight-forward rate-limiting, and the current users not needing accurate timekeeping, this uses a bool time_check_expr. Figured I'd keep it simple until someone actually needs greater complexity as you suggested.diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index d4f581c1e21d..e8793347a395 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h@@ -273,6 +273,34 @@ do { \ }) #endif + +#ifndef SMP_TIMEWAIT_SPIN_COUNT +#define SMP_TIMEWAIT_SPIN_COUNT 200 +#endif + +#ifndef smp_cond_load_relaxed_timewait +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, \ + time_check_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT; \ + \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + if (++__n < __spin) \ + continue; \ + if ((time_check_expr)) \ + break; \ + __n = 0; \ + } \ + (typeof(*ptr))VAL; \ +}) +#endif
This looks fine, at least as it would be used by poll_idle(). The only reason for not folding time_check_expr into cond_expr is the poll_idle() requirement to avoid calling time_check_expr too often.
quoted hunk ↗ jump to hunk
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index f5801b0ba9e9..c9934ab68da2 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h@@ -219,6 +219,43 @@ do { \ (typeof(*ptr))VAL; \ }) +extern bool arch_timer_evtstrm_available(void); + +#ifndef SMP_TIMEWAIT_SPIN_COUNT +#define SMP_TIMEWAIT_SPIN_COUNT 200 +#endif + +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, \ + time_check_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + u32 __n = 0, __spin = 0; \ + bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT); \ + bool __wfe = arch_timer_evtstrm_available(); \ + bool __wait = false; \ + \ + if (__wfet || __wfe) \ + __wait = true; \ + else \ + __spin = SMP_TIMEWAIT_SPIN_COUNT; \ + \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + if (++__n < __spin) \ + continue; \ + if ((time_check_expr)) \ + break; \ + if (__wait) \ + __cmpwait_relaxed(__PTR, VAL); \ + __n = 0; \ + } \ + (typeof(*ptr))VAL; \ +})
For arm64, I wouldn't bother with the spin count. Since cpu_relax() doesn't do anything, I doubt it makes any difference, especially as we are likely to use WFE anyway. If we do add one, I'd like it backed by some numbers to show it makes a difference in practice. The question is whether 100us granularity is good enough for poll_idle() (I came to the conclusion it's fine for rqspinlock, given their 1ms deadlock check).
#include <asm-generic/barrier.h> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.
Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
(it just happens to have one with the event stream).
We could extend this or create a new one that uses WFET and takes an
argument. If extending this one, for example a timeout argument of 0
means WFE, non-zero means WFET cycles. This adds a couple of more
instructions.
What I had in mind of time_expr was a ktime_t would be something like:
for (;;) {
VAL = READ_ONCE(*__PTR);
if (cond_expr)
break;
cycles = some_func_of(time_expr); // see __udelay()
if (cycles <= 0)
break;
if (__wfet) {
__cmpwait_relaxed(__PTR, VAL, get_cycles() + cycles);
} else if (__wfe && cycles >= timer_evt_period) {
__cmpwait_relaxed(__PTR, VAL, 0);
} else {
cpu_relax();
}
}
Now, if we don't care about the time check granularity (for now) and
time_check_expr is a bool (this seems to work better for rqspinlock), I
think we could do something like:
for (;;) {
VAL = READ_ONCE(*__PTR);
if (cond_expr)
break;
if (time_check_expr)
break;
if (__wfe) {
__cmpwait_relaxed(__PTR, VAL, 0);
} else if (__wfet) {
__cmpwait_relaxed(__PTR, VAL, get_cycles() + timer_evt_period);
} else {
cpu_relax();
}
}
We go with WFE first in this case to avoid get_cycles() unnecessarily.
I'd suggest we add the WFET support in __cmpwait_relaxed() (or a
different function) as a separate patch, doesn't even need to be part of
this series. WFE is good enough to get things moving. WFET will only
make a difference if (1) we disable the event stream or (2) we need
better accuracy of the timeout.
AFAICT the rqspinlock code should be able to work by specifying something like: ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context))) as the time_check_expr.
Why not the whole RES_CHECK_TIMEOUT(...) as in rqspinlock.c? It does the deadlock check only after a timeout over a millisecond. Just follow the res_atomic_cond_read_acquire() calls but replace '||' with a comma.
I think they also want to rate limit how often deadlock_check() is called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large value for arm64.
Everyone would want a different rate of checking other stuff, so I think this needs to go in their time_check_expr. -- Catalin