RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Michael Kelley <hidden>
Date: 2024-01-09 19:22:43
Also in: linux-hyperv, linux-rdma, lkml

From: Souradeep Chakrabarti <redacted> Sent: Tuesday, January 9, 2024 2:51 AM

From: Yury Norov <yury.norov@gmail.com>

Souradeep investigated that the driver performs faster if IRQs are
spread on CPUs with the following heuristics:

1. No more than one IRQ per CPU, if possible;
2. NUMA locality is the second priority;
3. Sibling dislocality is the last priority.

Let's consider this topology:

Node            0               1
Core        0       1       2       3
CPU       0   1   2   3   4   5   6   7

The most performant IRQ distribution based on the above topology
and heuristics may look like this:

IRQ     Nodes   Cores   CPUs
0       1       0       0-1
1       1       1       2-3
2       1       0       0-1
3       1       1       2-3
4       2       2       4-5
5       2       3       6-7
6       2       2       4-5
7       2       3       6-7

I didn't pay attention to the detailed discussion of this issue
over the past 2 to 3 weeks during the holidays in the U.S., but
the above doesn't align with the original problem as I understood
it.  I thought the original problem was to avoid putting IRQs on
both hyper-threads in the same core, and that the perf
improvements are based on that configuration.  At least that's
what the commit message for Patch 4/4 in this series says.

The above chart results in 8 IRQs being assigned to the 8 CPUs,
probably with 1 IRQ per CPU.   At least on x86, if the affinity
mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
should balance the IRQ assignments between the CPUs in the mask.
So the original problem is still present because both hyper-threads
in a core are likely to have an IRQ assigned.

Of course, this example has 8 IRQs and 8 CPUs, so assigning an
IRQ to every hyper-thread may be the only choice.  If that's the
case, maybe this just isn't a good example to illustrate the
original problem and solution.  But even with a better example
where the # of IRQs is <= half the # of CPUs in a NUMA node,
I don't think the code below accomplishes the original intent.

Maybe I've missed something along the way in getting to this
version of the patch.  Please feel free to set me straight. :-)

Michael

quoted hunk ↗ jump to hunk

The irq_setup() routine introduced in this patch leverages the
for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
as described above.

According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
based on cpumask_local_spread() performance test results look like this:

/ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: #####  Totals:  #####
08:06:28 INFO: test duration    :60.00 seconds
08:06:28 INFO: total bytes      :630292053310
08:06:28 INFO:   throughput     :84.04Gbps
08:06:28 INFO:   retrans segs   :4
08:06:28 INFO: cpu cores        :192
08:06:28 INFO:   cpu speed      :3799.725MHz
08:06:28 INFO:   user           :0.05%
08:06:28 INFO:   system         :1.60%
08:06:28 INFO:   idle           :96.41%
08:06:28 INFO:   iowait         :0.00%
08:06:28 INFO:   softirq        :1.94%
08:06:28 INFO:   cycles/byte    :2.50
08:06:28 INFO: cpu busy (all)   :534.41%

For NUMA- and sibling-aware IRQ distribution, the same test works
15% faster:

/ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: #####  Totals:  #####
08:09:56 INFO: test duration    :60.00 seconds
08:09:56 INFO: total bytes      :741966608384
08:09:56 INFO:   throughput     :98.93Gbps
08:09:56 INFO:   retrans segs   :6
08:09:56 INFO: cpu cores        :192
08:09:56 INFO:   cpu speed      :3799.791MHz
08:09:56 INFO:   user           :0.06%
08:09:56 INFO:   system         :1.81%
08:09:56 INFO:   idle           :96.18%
08:09:56 INFO:   iowait         :0.00%
08:09:56 INFO:   softirq        :1.95%
08:09:56 INFO:   cycles/byte    :2.25
08:09:56 INFO: cpu busy (all)   :569.22%

[1]
https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3
yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Souradeep Chakrabarti
[off-list ref]
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 29
+++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c

b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 6367de0c2c2e..6a967d6be01e 100644

--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c

@@ -1243,6 +1243,35 @@ void mana_gd_free_res_map(struct gdma_resource *r)
 	r->size = 0;
 }

+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
+{
+	const struct cpumask *next, *prev = cpu_none_mask;
+	cpumask_var_t cpus __free(free_cpumask_var);
+	int cpu, weight;
+
+	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+		return -ENOMEM;
+
+	rcu_read_lock();
+	for_each_numa_hop_mask(next, node) {
+		weight = cpumask_weight_andnot(next, prev);
+		while (weight > 0) {
+			cpumask_andnot(cpus, next, prev);
+			for_each_cpu(cpu, cpus) {
+				if (len-- == 0)
+					goto done;
+				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
+				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
+				--weight;
+			}
+		}
+		prev = next;
+	}
+done:
+	rcu_read_unlock();
+	return 0;
+}
+
 static int mana_gd_setup_irqs(struct pci_dev *pdev)
 {
 	unsigned int max_queues_per_port = num_online_cpus();
--

2.34.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help