Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask

From: Shrikanth Hegde <hidden>
Date: 2026-05-27 07:01:44
Also in: lkml

Hi Chen, Prateek.

I got back to work today, sorry for delay.
I am trying to go through the mails.
Apologies in case i have missed any bits.

On 5/26/26 7:38 PM, Chen Yu wrote:

Hi Prateek,

On Tue, 26 May 2026 11:23:59 +0530, K Prateek Nayak [off-list ref] wrote:

quoted

Hello Srikar,

On 5/26/2026 10:28 AM, Srikar Dronamraju wrote:

quoted

L2 Cache reported here is for SMT8 Core aka CACHE domain.

Apart for the scheduler, nothing in tree currently cares about
cpu_coregroup_mask() except for drivers/base/arch_topology.c but
Power doesn't select GENERIC_ARCH_TOPOLOGY.

Why can't Power have an internal mask for MC domain (tl_mc_mask) and
the scheduler can use cpu_coregroup_mask() for the actual LLc? (The L2
mask in this case.)

This seems wrong. there is no notion that coregroup_mask
(MC domain) has to point at LLC domain.

For example, on Shared LPAR, there is no MC domain and LLC is at SMT core level.
In that case coregroup_mask has point at SMT mask is wrong.

If we need a mask to point to the LLC mask which arch has to return, then we would
need a new api say cpu_llc_mask ? that can point accordingly.

I don't like mixing MC domain and LLC into one bit.

quoted hunk ↗ jump to hunk

quoted

Power anyways adds its own topology via set_sched_topology() so the
default_topology from kernel/sched/topology.c remains unused.

...

quoted

Shouldnt cache-aware scheduling be worried about cpuset partitions too.
If a cpuset has subset of LLC cores, then we should Scheduler assume it can
control complete LLC?

Well, the scheduling takes care of partitions and the cache aware
scheduling bits take care of looking at the full system perspective
for stats aggregation and pointing to a particular LLc.

We don't compare llc_id across cpusets so we keeping one unique llc_id
per H/W LLC instance is feasible and it enables us to keep llc_id space
limited for optimizing cache-aware scheduling.

Now if we have threads of same process across partitions, we'll
still aggregate the utilization numbers across the full LLC but
the load balancers at individual partitions will make a call on
the aggregation.

-- 
Thanks and Regards,
Prateek

I suppose what you suggested looks like below:

powerpc/smp: make cpu_coregroup_mask() return the LLC

On pSeries shared LPARs(or coregroup_enabled is false on
Power9 and earlier) the hemisphere map is not allocated, so
build_sched_domains() dereferences a NULL cpumask and crashes.

The generic scheduler expects cpu_coregroup_mask() to span the LLC.
On powerpc the LLC is the L2. Return cpu_l2_cache_mask() instead of
the hemisphere map. Use a coregroup_map() helper for the in-file
hemisphere users, and a powerpc_tl_mc_mask() wrapper for the MC
sched-domain level.

Fixes: b5ea300a17e3 ("sched/cache: Make LLC id continuous")
Reported-by: Venkat Rao Bagalkote <redacted>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
  arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++------------
  1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c

@@ -1040,11 +1040,22 @@ static const struct cpumask *tl_smallcore_smt_mask(struct sched_domain_topology_
  }
  #endif
  
+static inline struct cpumask *coregroup_map(int cpu)
+{
+	return per_cpu(cpu_coregroup_map, cpu);
+}
+
  struct cpumask *cpu_coregroup_mask(int cpu)
  {
-	return per_cpu(cpu_coregroup_map, cpu);
+	return cpu_l2_cache_mask(cpu);
+}

This looks wrong to me too. In different hardware topologies
there maybe distinction between coregroup and l2 mask.

Let me go through the code and see if there is better way.

quoted hunk ↗ jump to hunk

+
+static const struct cpumask *
+powerpc_tl_mc_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+	return coregroup_map(cpu);
  }
  
  static bool has_coregroup_support(void)
  {
  	if (is_shared_processor())

@@ -1155,7 +1166,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
  	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
  
  	if (has_coregroup_support())
-		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+		cpumask_set_cpu(boot_cpuid, coregroup_map(boot_cpuid));
  
  	init_big_cores();
  	if (has_big_cores) {
@@ -1520,8 +1531,8 @@ static void remove_cpu_from_masks(int cpu)
  		set_cpus_unrelated(cpu, i, cpu_core_mask);
  
  	if (has_coregroup_support()) {
-		for_each_cpu(i, cpu_coregroup_mask(cpu))
-			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
+		for_each_cpu(i, coregroup_map(cpu))
+			set_cpus_unrelated(cpu, i, coregroup_map);
  	}
  }
  #endif
@@ -1553,7 +1564,7 @@ static void update_coregroup_mask(int cpu, cpumask_var_t *mask)
  	if (!*mask) {
  		/* Assume only siblings are part of this CPU's coregroup */
  		for_each_cpu(i, submask_fn(cpu))
-			set_cpus_related(cpu, i, cpu_coregroup_mask);
+			set_cpus_related(cpu, i, coregroup_map);
  
  		return;
  	}
@@ -1561,18 +1572,18 @@ static void update_coregroup_mask(int cpu, cpumask_var_t *mask)
  	cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
  
  	/* Update coregroup mask with all the CPUs that are part of submask */
-	or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
+	or_cpumasks_related(cpu, cpu, submask_fn, coregroup_map);
  
  	/* Skip all CPUs already part of coregroup mask */
-	cpumask_andnot(*mask, *mask, cpu_coregroup_mask(cpu));
+	cpumask_andnot(*mask, *mask, coregroup_map(cpu));
  
  	for_each_cpu(i, *mask) {
  		/* Skip all CPUs not part of this coregroup */
  		if (coregroup_id == cpu_to_coregroup_id(i)) {
-			or_cpumasks_related(cpu, i, submask_fn, cpu_coregroup_mask);
+			or_cpumasks_related(cpu, i, submask_fn, coregroup_map);
  			cpumask_andnot(*mask, *mask, submask_fn(i));
  		} else {
-			cpumask_andnot(*mask, *mask, cpu_coregroup_mask(i));
+			cpumask_andnot(*mask, *mask, coregroup_map(i));
  		}
  	}
  }
@@ -1733,7 +1744,7 @@ static void __init build_sched_topology(void)
  
  	if (has_coregroup_support()) {
  		powerpc_topology[i++] =
-			SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);

I would prefer not do this rename. having tl_mc_mask helps to find the usage across
the codebase.

+			SDTL_INIT(powerpc_tl_mc_mask, powerpc_shared_proc_flags, MC);
  	}
  
  	powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help