Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask
From: Shrikanth Hegde <hidden>
Date: 2026-05-27 07:01:44
Also in:
lkml
Hi Chen, Prateek. I got back to work today, sorry for delay. I am trying to go through the mails. Apologies in case i have missed any bits. On 5/26/26 7:38 PM, Chen Yu wrote:
Hi Prateek, On Tue, 26 May 2026 11:23:59 +0530, K Prateek Nayak [off-list ref] wrote:quoted
Hello Srikar, On 5/26/2026 10:28 AM, Srikar Dronamraju wrote:quoted
L2 Cache reported here is for SMT8 Core aka CACHE domain.Apart for the scheduler, nothing in tree currently cares about cpu_coregroup_mask() except for drivers/base/arch_topology.c but Power doesn't select GENERIC_ARCH_TOPOLOGY. Why can't Power have an internal mask for MC domain (tl_mc_mask) and the scheduler can use cpu_coregroup_mask() for the actual LLc? (The L2 mask in this case.)
This seems wrong. there is no notion that coregroup_mask (MC domain) has to point at LLC domain. For example, on Shared LPAR, there is no MC domain and LLC is at SMT core level. In that case coregroup_mask has point at SMT mask is wrong. If we need a mask to point to the LLC mask which arch has to return, then we would need a new api say cpu_llc_mask ? that can point accordingly. I don't like mixing MC domain and LLC into one bit.
quoted hunk ↗ jump to hunk
quoted
Power anyways adds its own topology via set_sched_topology() so the default_topology from kernel/sched/topology.c remains unused. ...quoted
Shouldnt cache-aware scheduling be worried about cpuset partitions too. If a cpuset has subset of LLC cores, then we should Scheduler assume it can control complete LLC?Well, the scheduling takes care of partitions and the cache aware scheduling bits take care of looking at the full system perspective for stats aggregation and pointing to a particular LLc. We don't compare llc_id across cpusets so we keeping one unique llc_id per H/W LLC instance is feasible and it enables us to keep llc_id space limited for optimizing cache-aware scheduling. Now if we have threads of same process across partitions, we'll still aggregate the utilization numbers across the full LLC but the load balancers at individual partitions will make a call on the aggregation. -- Thanks and Regards, PrateekI suppose what you suggested looks like below: powerpc/smp: make cpu_coregroup_mask() return the LLC On pSeries shared LPARs(or coregroup_enabled is false on Power9 and earlier) the hemisphere map is not allocated, so build_sched_domains() dereferences a NULL cpumask and crashes. The generic scheduler expects cpu_coregroup_mask() to span the LLC. On powerpc the LLC is the L2. Return cpu_l2_cache_mask() instead of the hemisphere map. Use a coregroup_map() helper for the in-file hemisphere users, and a powerpc_tl_mc_mask() wrapper for the MC sched-domain level. Fixes: b5ea300a17e3 ("sched/cache: Make LLC id continuous") Reported-by: Venkat Rao Bagalkote <redacted> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> --- arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-)diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c@@ -1040,11 +1040,22 @@ static const struct cpumask *tl_smallcore_smt_mask(struct sched_domain_topology_ } #endif +static inline struct cpumask *coregroup_map(int cpu) +{ + return per_cpu(cpu_coregroup_map, cpu); +} + struct cpumask *cpu_coregroup_mask(int cpu) { - return per_cpu(cpu_coregroup_map, cpu); + return cpu_l2_cache_mask(cpu); +}
This looks wrong to me too. In different hardware topologies there maybe distinction between coregroup and l2 mask. Let me go through the code and see if there is better way.
quoted hunk ↗ jump to hunk
+ +static const struct cpumask * +powerpc_tl_mc_mask(struct sched_domain_topology_level *tl, int cpu) +{ + return coregroup_map(cpu); } static bool has_coregroup_support(void) { if (is_shared_processor())@@ -1155,7 +1166,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); if (has_coregroup_support()) - cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); + cpumask_set_cpu(boot_cpuid, coregroup_map(boot_cpuid)); init_big_cores(); if (has_big_cores) {@@ -1520,8 +1531,8 @@ static void remove_cpu_from_masks(int cpu) set_cpus_unrelated(cpu, i, cpu_core_mask); if (has_coregroup_support()) { - for_each_cpu(i, cpu_coregroup_mask(cpu)) - set_cpus_unrelated(cpu, i, cpu_coregroup_mask); + for_each_cpu(i, coregroup_map(cpu)) + set_cpus_unrelated(cpu, i, coregroup_map); } } #endif@@ -1553,7 +1564,7 @@ static void update_coregroup_mask(int cpu, cpumask_var_t *mask) if (!*mask) { /* Assume only siblings are part of this CPU's coregroup */ for_each_cpu(i, submask_fn(cpu)) - set_cpus_related(cpu, i, cpu_coregroup_mask); + set_cpus_related(cpu, i, coregroup_map); return; }@@ -1561,18 +1572,18 @@ static void update_coregroup_mask(int cpu, cpumask_var_t *mask) cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu)); /* Update coregroup mask with all the CPUs that are part of submask */ - or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask); + or_cpumasks_related(cpu, cpu, submask_fn, coregroup_map); /* Skip all CPUs already part of coregroup mask */ - cpumask_andnot(*mask, *mask, cpu_coregroup_mask(cpu)); + cpumask_andnot(*mask, *mask, coregroup_map(cpu)); for_each_cpu(i, *mask) { /* Skip all CPUs not part of this coregroup */ if (coregroup_id == cpu_to_coregroup_id(i)) { - or_cpumasks_related(cpu, i, submask_fn, cpu_coregroup_mask); + or_cpumasks_related(cpu, i, submask_fn, coregroup_map); cpumask_andnot(*mask, *mask, submask_fn(i)); } else { - cpumask_andnot(*mask, *mask, cpu_coregroup_mask(i)); + cpumask_andnot(*mask, *mask, coregroup_map(i)); } } }@@ -1733,7 +1744,7 @@ static void __init build_sched_topology(void) if (has_coregroup_support()) { powerpc_topology[i++] = - SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
I would prefer not do this rename. having tl_mc_mask helps to find the usage across the codebase.
+ SDTL_INIT(powerpc_tl_mc_mask, powerpc_shared_proc_flags, MC); } powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);