RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT cores

From: Souradeep Chakrabarti <hidden>
Date: 2023-11-21 13:59:34
Also in: linux-rdma, lkml, netdev

-----Original Message-----
From: Michael Kelley <redacted>
Sent: Thursday, November 16, 2023 11:47 AM
To: Michael Kelley <redacted>; Souradeep Chakrabarti
[off-list ref]; KY Srinivasan [off-list ref];
Haiyang Zhang [off-list ref]; wei.liu@kernel.org; Dexuan Cui
[off-list ref]; davem@davemloft.net; edumazet@google.com;
kuba@kernel.org; pabeni@redhat.com; Long Li [off-list ref];
leon@kernel.org; cai.huoqing@linux.dev; ssengar@linux.microsoft.com;
vkuznets@redhat.com; tglx@linutronix.de; linux-hyperv@vger.kernel.org;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
rdma@vger.kernel.org
Cc: Souradeep Chakrabarti <redacted>; Paul Rosswurm
[off-list ref]
Subject: [EXTERNAL] RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT
cores

[Some people who received this message don't often get email from
mhklinux@outlook.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]

From: Michael Kelley <redacted> Sent: Wednesday, November 15,
2023 9:26 PM

quoted

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c

b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 6367de0c2c2e..839be819d46e 100644

--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c

@@ -1243,13 +1243,115 @@ void mana_gd_free_res_map(struct

gdma_resource *r)
    r->size = 0;
 }

+static void cpu_mask_set(cpumask_var_t *filter_mask, cpumask_var_t
**filter_mask_list)
+{
+   unsigned int core_count = 0, cpu;
+   cpumask_var_t *filter_mask_list_tmp;
+
+   BUG_ON(!filter_mask || !filter_mask_list);

This check seems superfluous. The function is invoked from only one
call site in the code below, and that site call site can easily ensure
that it doesn't pass a NULL value for either parameter.

Fixed it in V2 patch.

quoted

+   filter_mask_list_tmp = *filter_mask_list;
+   cpumask_copy(*filter_mask, cpu_online_mask);
+   /* for each core create a cpumask lookup table,
+    * which stores all the corresponding siblings
+    */
+   for_each_cpu(cpu, *filter_mask) {
+

      BUG_ON(!alloc_cpumask_var(&filter_mask_list_tmp[core_count],
GFP_KERNEL));

Don't panic if a memory allocation fails.  Must back out, clean up,
and return an error.

quoted

+           cpumask_or(filter_mask_list_tmp[core_count],

filter_mask_list_tmp[core_count],

quoted

+                      topology_sibling_cpumask(cpu));

alloc_cpumask_var() does not zero out the returned cpumask.  So the
cpumask_or() is operating on uninitialized garbage.  Use
zalloc_cpumask_var() to get a cpumask initialized to zero.

But why is the cpumask_or() being done at all?  Couldn't this just be
a cpumask_copy()?  In that case, alloc_cpumask_var() is OK.

quoted

+           cpumask_andnot(*filter_mask, *filter_mask,

topology_sibling_cpumask(cpu));

quoted

+           core_count++;
+   }
+}
+
+static int irq_setup(int *irqs, int nvec) {
+   cpumask_var_t filter_mask;
+   cpumask_var_t *filter_mask_list;
+   unsigned int cpu_first, cpu, irq_start, cores = 0;
+   int i, core_count = 0, numa_node, cpu_count = 0, err = 0;
+
+   BUG_ON(!alloc_cpumask_var(&filter_mask, GFP_KERNEL));

Again, don't panic if a memory allocation fails.  Must back out and
return an error.

quoted

+   cpus_read_lock();
+   cpumask_copy(filter_mask, cpu_online_mask);

For readability, I'd suggest adding a blank line before any of the
multi-line comments below.

quoted

+   /* count the number of cores
+    */
+   for_each_cpu(cpu, filter_mask) {
+           cpumask_andnot(filter_mask, filter_mask,

topology_sibling_cpumask(cpu));

quoted

+           cores++;
+   }
+   filter_mask_list = kcalloc(cores, sizeof(cpumask_var_t), GFP_KERNEL);
+   if (!filter_mask_list) {
+           err = -ENOMEM;
+           goto free_irq;

One additional macro-level comment.  Creating and freeing the filter_mask_list is
a real pain.  But it is needed to identify which CPUs in a core are available to have
an IRQ assigned to them.  So let me suggest a modified approach to meeting that
need.

1) Count the number of cores just as before.

2) Then instead of allocating filter_mask_list, allocate an array of integers, with one
array entry per core.  Call the array core_id_list.
Somewhat like the code in cpu_mask_set(), populate the array with the CPU
number of the first CPU in each core.  The populating only needs to be done once,
so the code can be inline in irq_setup().
It doesn't need to be in a helper function like cpu_mask_set().

3) Allocate a single cpumask_var_t local variable that is initialized to a copy of
cpu_online_mask.  Call it avail_cpus.  This local variable indicates which CPUs
(across all cores) are available to have an IRQ assigned.

4) At the beginning of the "for" loop over nvec, current code has:

       cpu_first = cpumask_first(filter_mask_list[core_count]);

Instead, do:

       cpu_first = cpumask_first_and(&avail_cpus,
                       topology_sibling_cpumask(core_id_list[core_count]));

The above code effectively creates on-the-fly the cpumask previously stored in
filter_mask_list, and finds the first CPU as before.

5) When a CPU is assigned an IRQ, clear that CPU in avail_cpus, not in the
filter_mask_list entry.

6) When the NUMA node wraps around and current code calls cpu_mask_set(),
instead reinitialize avail_cpus to cpu_online_mask like in my #3 above.

In summary, with this approach filter_mask_list isn't needed at all.  The messiness
of allocating and freeing an array of cpumask's goes away.  I haven't coded it, but I
think the result will be simpler and less error prone.

Changed it in V2.

Michael

quoted

+   }
+   /* if number of cpus are equal to max_queues per port, then
+    * one extra interrupt for the hardware channel communication.
+    */

Note that higher level code may set nvec based on the # of online
CPUs, but until the cpus_read_lock is held, the # of online CPUs can
change. So nvec might have been determined when the # of CPUs was 8,
but 2 CPUs could go offline before the cpus_read_lock is obtained.  So
nvec could be bigger than just 1 more than num_online_cpus().  I'm not
sure how to handle the check below to special case the hardware
communication channel.  But realize that the main "for" loop below
could end up assigning multiple IRQs to the same CPU if nvec >
num_online_cpus().

quoted

+   if (nvec - 1 == num_online_cpus()) {
+           irq_start = 1;
+           cpu_first = cpumask_first(cpu_online_mask);
+           irq_set_affinity_and_hint(irqs[0], cpumask_of(cpu_first));
+   } else {
+           irq_start = 0;
+   }
+   /* reset the core_count and num_node to 0.
+    */
+   core_count = 0;
+   numa_node = 0;
+   cpu_mask_set(&filter_mask, &filter_mask_list);

FWIW, passing filter_mask as the first argument above was confusing to
me until I realized that the value of filter_mask doesn't matter --
it's just to use the memory allocated for filter_mask.  Maybe a
comment would help.  And ditto for the invocation of cpu_mask_set()
further down.

Fixed it in V2.

quoted

+   /* for each interrupt find the cpu of a particular
+    * sibling set and if it belongs to the specific numa
+    * then assign irq to it and clear the cpu bit from
+    * the corresponding sibling list from filter_mask_list.
+    * Increase the cpu_count for that node.
+    * Once all cpus for a numa node is assigned, then
+    * move to different numa node and continue the same.
+    */
+   for (i = irq_start; i < nvec; ) {
+           cpu_first = cpumask_first(filter_mask_list[core_count]);
+           if (cpu_first < nr_cpu_ids && cpu_to_node(cpu_first) ==
+ numa_node) {

Note that it's possible to have a NUMA node with zero online CPUs.
It could be a NUMA node that was originally a memory-only NUMA node
and never had any CPUs, or the NUMA node could have had CPUs, but they
were all later taken offline.  Such a NUMA node is still considered to
be online because it has memory, but it has no online CPUs.

If this code ever sets "numa_node" to such a NUMA node, the "for" loop
will become an infinite loop because the "if"
statement above will never find a CPU that is assigned to "numa_node".

quoted

+                   irq_set_affinity_and_hint(irqs[i], cpumask_of(cpu_first));
+                   cpumask_clear_cpu(cpu_first, filter_mask_list[core_count]);
+                   cpu_count = cpu_count + 1;
+                   i = i + 1;
+                   /* checking if all the cpus are used from the
+                    * particular node.
+                    */
+                   if (cpu_count == nr_cpus_node(numa_node)) {
+                           numa_node = numa_node + 1;
+                           if (numa_node == num_online_nodes()) {
+                                   cpu_mask_set(&filter_mask,
+ &filter_mask_list);

The second and subsequent calls to cpu_mask_set() will leak any memory
previously allocated by alloc_cpumask_var() in cpu_mask_set().

I agree with Haiyang's comment about starting with a NUMA node other
than NUMA node zero.  But if you do so, note that testing for
wrap-around at num_online_nodes() won't be equivalent to testing for
getting back to the starting NUMA node.
You really want to run cpu_mask_set() again only when you get back to
the starting NUMA node.

Fixed it in V2.

quoted

+                                   numa_node = 0;
+                           }
+                           cpu_count = 0;
+                           core_count = 0;
+                           continue;
+                   }
+           }
+           if ((core_count + 1) % cores == 0)
+                   core_count = 0;
+           else
+                   core_count++;

The if/else could be more compactly written as:

              if (++core_count == cores)
                      core_count = 0;

quoted

+   }
+free_irq:
+   cpus_read_unlock();
+   if (filter_mask)

This check for non-NULL filter_mask is incorrect and might not even
compile if CONFIG_CPUMASK_OFFSTACK=n.  For testing purposes, you
should make sure to build and run with each
option:  with CONFIG_CPUMASK_OFFSTACK=y and also
CONFIG_CPUMASK_OFFSTACK=n.

Fixed it in V2.

quoted

It's safe to just call free_cpumask_var() without the check.

quoted

+           free_cpumask_var(filter_mask);
+   if (filter_mask_list) {
+           for (core_count = 0; core_count < cores; core_count++)
+                   free_cpumask_var(filter_mask_list[core_count]);
+           kfree(filter_mask_list);
+   }
+   return err;
+}

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help