Re: Inlining migrate_disable/enable. Was: [PATCH bpf-next v2 02/18] x86,bpf:... | netdev

Re: Inlining migrate_disable/enable. Was: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for global trampoline

From: Alexei Starovoitov <hidden>
Date: 2025-08-08 00:58:35
Also in: bpf, lkml

On Wed, Aug 6, 2025 at 1:44 AM Menglong Dong [off-list ref] wrote:

On Fri, Aug 1, 2025 at 12:15 AM Alexei Starovoitov
[off-list ref] wrote:

quoted

On Mon, Jul 28, 2025 at 2:20 AM Menglong Dong [off-list ref] wrote:

quoted

On Thu, Jul 17, 2025 at 6:35 AM Alexei Starovoitov
[off-list ref] wrote:

quoted

On Wed, Jul 16, 2025 at 11:24 AM Peter Zijlstra [off-list ref] wrote:

quoted

On Wed, Jul 16, 2025 at 09:56:11AM -0700, Alexei Starovoitov wrote:

quoted

Maybe Peter has better ideas ?

Is it possible to express runqueues::nr_pinned as an alias?

extern unsigned int __attribute__((alias("runqueues.nr_pinned"))) this_nr_pinned;

And use:

        __this_cpu_inc(&this_nr_pinned);


This syntax doesn't actually seem to work; but can we construct
something like that?

Yeah. Iant is right. It's a string and not a pointer dereference.
It never worked.

Few options:

1.
 struct rq {
+#ifdef CONFIG_SMP
+       unsigned int            nr_pinned;
+#endif
        /* runqueue lock: */
        raw_spinlock_t          __lock;

@@ -1271,9 +1274,6 @@ struct rq {
        struct cpuidle_state    *idle_state;
 #endif

-#ifdef CONFIG_SMP
-       unsigned int            nr_pinned;
-#endif

but ugly...

2.
static unsigned int nr_pinned_offset __ro_after_init __used;
RUNTIME_CONST(nr_pinned_offset, nr_pinned_offset)

overkill for what's needed

3.
OFFSET(RQ_nr_pinned, rq, nr_pinned);
then
#include <generated/asm-offsets.h>

imo the best.

I had a try. The struct rq is not visible to asm-offsets.c, so we
can't define it in arch/xx/kernel/asm-offsets.c. Do you mean
to define a similar rq-offsets.c in kernel/sched/ ? It will be more
complex than the way 2, and I think the second way 2 is
easier :/

2 maybe easier, but it's an overkill.
I still think asm-offset is cleaner.
arch/xx shouldn't be used, of course, since this nr_pinned should
be generic for all archs.
We can do something similar to drivers/memory/emif-asm-offsets.c
and do that within kernel/sched/.
rq-offsets.c as you said.
It will generate rq-offsets.h in a build dir that can be #include-d.

I thought about another alternative (as a derivative of 1):
split nr_pinned from 'struct rq' into its own per-cpu variable,
but I don't think that will work, since rq_has_pinned_tasks()
doesn't always operate on this_rq().
So the acceptable choices are realistically 1 and 3 and
rq-offsets.c seems cleaner.
Pls give it another try.

Generally speaking, the way 3 works. The only problem is how
we handle this_rq(). I introduced following code in
include/linux/sched.h:

struct rq;
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define this_rq_ptr() arch_raw_cpu_ptr(&runqueues)

The this_rq_ptr() is used in migrate_enable(). I have to use the
arch_raw_cpu_ptr() for it. this_cpu_ptr() can't be used here, as
it will fail on this_cpu_ptr -> raw_cpu_ptr -> __verify_pcpu_ptr:

#define __verify_pcpu_ptr(ptr)                        \
do {                                    \
    const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;    \
    (void)__vpp_verify;                        \
} while (0)

The struct rq is not available here, which makes the typeof((ptr) + 0)
fail during compiling. What can we do here?

Interesting.
The comment says:
 * + 0 is required in order to convert the pointer type from a
 * potential array type to a pointer to a single item of the array.

so maybe we can do some macro magic to avoid '+ 0'
when type is already pointer,
but for now let's proceed with arch_raw_cpu_ptr().

According to my testing, the performance of fentry increased from
111M/s to 121M/s with migrate_enable/disable inlined.

Very nice.

quoted hunk ↗ jump to hunk

Following is the whole patch:
-------------------------------------------------------------------------------------------

diff --git a/Kbuild b/Kbuild
index f327ca86990c..13324b4bbe23 100644
--- a/Kbuild
+++ b/Kbuild

@@ -34,13 +34,24 @@ arch/$(SRCARCH)/kernel/asm-offsets.s:

$(timeconst-file) $(bounds-file)
 $(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s FORCE
     $(call filechk,offsets,__ASM_OFFSETS_H__)

+# Generate rq-offsets.h
+
+rq-offsets-file := include/generated/rq-offsets.h
+
+targets += kernel/sched/rq-offsets.s
+
+kernel/sched/rq-offsets.s: $(offsets-file)
+
+$(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
+    $(call filechk,offsets,__RQ_OFFSETS_H__)
+
 # Check for missing system calls

 quiet_cmd_syscalls = CALL    $<
       cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags)
$(missing_syscalls_flags)

 PHONY += missing-syscalls
-missing-syscalls: scripts/checksyscalls.sh $(offsets-file)
+missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
     $(call cmd,syscalls)

 # Check the manual modification of atomic headers

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 1fad1c8a4c76..3a1c08a75c09 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h

@@ -369,64 +369,6 @@ static inline void preempt_notifier_init(struct

preempt_notifier *notifier,

 #endif

-/*
- * Migrate-Disable and why it is undesired.

Keep the comment where it is. It will keep the diff smaller.
There is really no need to move it.

- *
- * When a preempted task becomes elegible to run under the ideal model (IOW it

but fix the typos.

- * becomes one of the M highest priority tasks), it might still have to wait
- * for the preemptee's migrate_disable() section to complete. Thereby suffering
- * a reduction in bandwidth in the exact duration of the migrate_disable()
- * section.
- *
- * Per this argument, the change from preempt_disable() to migrate_disable()
- * gets us:
- *
- * - a higher priority tasks gains reduced wake-up latency; with
preempt_disable()
- *   it would have had to wait for the lower priority task.
- *
- * - a lower priority tasks; which under preempt_disable() could've instantly
- *   migrated away when another CPU becomes available, is now constrained
- *   by the ability to push the higher priority task away, which
might itself be
- *   in a migrate_disable() section, reducing it's available bandwidth.
- *
- * IOW it trades latency / moves the interference term, but it stays in the
- * system, and as long as it remains unbounded, the system is not fully
- * deterministic.
- *
- *
- * The reason we have it anyway.
- *
- * PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a
- * number of primitives into becoming preemptible, they would also allow
- * migration. This turns out to break a bunch of per-cpu usage. To this end,
- * all these primitives employ migirate_disable() to restore this implicit
- * assumption.
- *
- * This is a 'temporary' work-around at best. The correct solution is getting
- * rid of the above assumptions and reworking the code to employ explicit
- * per-cpu locking or short preempt-disable regions.
- *
- * The end goal must be to get rid of migrate_disable(), alternatively we need
- * a schedulability theory that does not depend on abritrary migration.

and this one.

quoted hunk ↗ jump to hunk

- *
- *
- * Notes on the implementation.
- *
- * The implementation is particularly tricky since existing code patterns
- * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
- * This means that it cannot use cpus_read_lock() to serialize against hotplug,
- * nor can it easily migrate itself into a pending affinity mask change on
- * migrate_enable().
- *
- *
- * Note: even non-work-conserving schedulers like semi-partitioned depends on
- *       migration, so migrate_disable() is not only a problem for
- *       work-conserving schedulers.
- *
- */
-extern void migrate_disable(void);
-extern void migrate_enable(void);
-
 /**
  * preempt_disable_nested - Disable preemption inside a normally
preempt disabled section
  *

@@ -471,7 +413,6 @@ static __always_inline void preempt_enable_nested(void)

 DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
 DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(),

preempt_enable_notrace())
-DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())

hmm. why?

quoted hunk ↗ jump to hunk

 #ifdef CONFIG_PREEMPT_DYNAMIC

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 40d2fa90df42..365ac6d17504 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -48,6 +48,9 @@
 #include <linux/uidgid_types.h>
 #include <linux/tracepoint-defs.h>
 #include <asm/kmap_size.h>
+#ifndef COMPILE_OFFSETS
+#include <generated/rq-offsets.h>
+#endif

 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;

@@ -2299,4 +2302,127 @@ static __always_inline void

alloc_tag_restore(struct alloc_tag *tag, struct allo
 #define alloc_tag_restore(_tag, _old)        do {} while (0)
 #endif

+#if defined(CONFIG_SMP) && !defined(COMPILE_OFFSETS)
+
+extern void __migrate_enable(void);
+
+struct rq;
+DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+#define this_rq_ptr() arch_raw_cpu_ptr(&runqueues)
+
+/*
+ * Migrate-Disable and why it is undesired.
+ *
+ * When a preempted task becomes elegible to run under the ideal model (IOW it
+ * becomes one of the M highest priority tasks), it might still have to wait
+ * for the preemptee's migrate_disable() section to complete. Thereby suffering
+ * a reduction in bandwidth in the exact duration of the migrate_disable()
+ * section.
+ *
+ * Per this argument, the change from preempt_disable() to migrate_disable()
+ * gets us:
+ *
+ * - a higher priority tasks gains reduced wake-up latency; with
preempt_disable()
+ *   it would have had to wait for the lower priority task.
+ *
+ * - a lower priority tasks; which under preempt_disable() could've instantly
+ *   migrated away when another CPU becomes available, is now constrained
+ *   by the ability to push the higher priority task away, which
might itself be
+ *   in a migrate_disable() section, reducing it's available bandwidth.
+ *
+ * IOW it trades latency / moves the interference term, but it stays in the
+ * system, and as long as it remains unbounded, the system is not fully
+ * deterministic.
+ *
+ *
+ * The reason we have it anyway.
+ *
+ * PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a
+ * number of primitives into becoming preemptible, they would also allow
+ * migration. This turns out to break a bunch of per-cpu usage. To this end,
+ * all these primitives employ migirate_disable() to restore this implicit
+ * assumption.
+ *
+ * This is a 'temporary' work-around at best. The correct solution is getting
+ * rid of the above assumptions and reworking the code to employ explicit
+ * per-cpu locking or short preempt-disable regions.
+ *
+ * The end goal must be to get rid of migrate_disable(), alternatively we need
+ * a schedulability theory that does not depend on abritrary migration.
+ *
+ *
+ * Notes on the implementation.
+ *
+ * The implementation is particularly tricky since existing code patterns
+ * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
+ * This means that it cannot use cpus_read_lock() to serialize against hotplug,
+ * nor can it easily migrate itself into a pending affinity mask change on
+ * migrate_enable().
+ *
+ *
+ * Note: even non-work-conserving schedulers like semi-partitioned depends on
+ *       migration, so migrate_disable() is not only a problem for
+ *       work-conserving schedulers.
+ *
+ */
+static inline void migrate_enable(void)
+{
+    struct task_struct *p = current;
+
+#ifdef CONFIG_DEBUG_PREEMPT
+    /*
+     * Check both overflow from migrate_disable() and superfluous
+     * migrate_enable().
+     */
+    if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
+        return;
+#endif
+
+    if (p->migration_disabled > 1) {
+        p->migration_disabled--;
+        return;
+    }
+
+    /*
+     * Ensure stop_task runs either before or after this, and that
+     * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
+     */
+    guard(preempt)();
+    __migrate_enable();

You're leaving performance on the table.
In many case bpf is one and only user of migrate_enable/disable
and it's not nested.
So this call is likely hot.
Move 'if (p->cpus_ptr != &p->cpus_mask)' check into .h
and only keep slow path of __set_cpus_allowed_ptr() in .c

Can probably wrap it with likely() too.

quoted hunk ↗ jump to hunk

+    /*
+     * Mustn't clear migration_disabled() until cpus_ptr points back at the
+     * regular cpus_mask, otherwise things that race (eg.
+     * select_fallback_rq) get confused.
+     */
+    barrier();
+    p->migration_disabled = 0;
+    (*(unsigned int *)((void *)this_rq_ptr() + RQ_nr_pinned))--;
+}
+
+static inline void migrate_disable(void)
+{
+    struct task_struct *p = current;
+
+    if (p->migration_disabled) {
+#ifdef CONFIG_DEBUG_PREEMPT
+        /*
+         *Warn about overflow half-way through the range.
+         */
+        WARN_ON_ONCE((s16)p->migration_disabled < 0);
+#endif
+        p->migration_disabled++;
+        return;
+    }
+
+    guard(preempt)();
+    (*(unsigned int *)((void *)this_rq_ptr() + RQ_nr_pinned))++;
+    p->migration_disabled = 1;
+}
+#else
+static inline void migrate_disable(void) { }
+static inline void migrate_enable(void) { }
+#endif
+
+DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
+
 #endif

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 399f03e62508..75d5f145ca60 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c

@@ -23853,8 +23853,7 @@ int bpf_check_attach_target(struct

bpf_verifier_log *log,
 BTF_SET_START(btf_id_deny)
 BTF_ID_UNUSED
 #ifdef CONFIG_SMP
-BTF_ID(func, migrate_disable)
-BTF_ID(func, migrate_enable)
+BTF_ID(func, __migrate_enable)
 #endif
 #if !defined CONFIG_PREEMPT_RCU && !defined CONFIG_TINY_RCU
 BTF_ID(func, rcu_read_unlock_strict)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ec00d08d46a..b521024c99ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c

@@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+EXPORT_SYMBOL_GPL(runqueues);

why?

quoted hunk ↗ jump to hunk

 #ifdef CONFIG_SCHED_PROXY_EXEC
 DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);

@@ -2375,28 +2376,7 @@ static void migrate_disable_switch(struct rq

*rq, struct task_struct *p)
     __do_set_cpus_allowed(p, &ac);
 }

-void migrate_disable(void)
-{
-    struct task_struct *p = current;
-
-    if (p->migration_disabled) {
-#ifdef CONFIG_DEBUG_PREEMPT
-        /*
-         *Warn about overflow half-way through the range.
-         */
-        WARN_ON_ONCE((s16)p->migration_disabled < 0);
-#endif
-        p->migration_disabled++;
-        return;
-    }
-
-    guard(preempt)();
-    this_rq()->nr_pinned++;
-    p->migration_disabled = 1;
-}
-EXPORT_SYMBOL_GPL(migrate_disable);
-
-void migrate_enable(void)
+void __migrate_enable(void)
 {
     struct task_struct *p = current;
     struct affinity_context ac = {

@@ -2404,37 +2384,10 @@ void migrate_enable(void)
         .flags     = SCA_MIGRATE_ENABLE,
     };

-#ifdef CONFIG_DEBUG_PREEMPT
-    /*
-     * Check both overflow from migrate_disable() and superfluous
-     * migrate_enable().
-     */
-    if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
-        return;
-#endif
-
-    if (p->migration_disabled > 1) {
-        p->migration_disabled--;
-        return;
-    }
-
-    /*
-     * Ensure stop_task runs either before or after this, and that
-     * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
-     */
-    guard(preempt)();
     if (p->cpus_ptr != &p->cpus_mask)
         __set_cpus_allowed_ptr(p, &ac);
-    /*
-     * Mustn't clear migration_disabled() until cpus_ptr points back at the
-     * regular cpus_mask, otherwise things that race (eg.
-     * select_fallback_rq) get confused.
-     */
-    barrier();
-    p->migration_disabled = 0;
-    this_rq()->nr_pinned--;
 }
-EXPORT_SYMBOL_GPL(migrate_enable);
+EXPORT_SYMBOL_GPL(__migrate_enable);

 static inline bool rq_has_pinned_tasks(struct rq *rq)
 {

diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c
new file mode 100644
index 000000000000..a23747bbe25b
--- /dev/null
+++ b/kernel/sched/rq-offsets.c

@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#define COMPILE_OFFSETS
+#include <linux/kbuild.h>
+#include <linux/types.h>
+#include "sched.h"
+
+int main(void)
+{
+    DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));

This part looks nice and sweet. Not sure what you were concerned about.

Respin it as a proper patch targeting tip tree.

And explain the motivation in commit log with detailed
'perf report' before/after along with 111M/s to 121M/s speed up,

I suspect with my other __set_cpus_allowed_ptr() suggestion
the speed up should be even bigger.

+    return 0;
+}

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help