Re: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per CPUs

From: Yury Norov <yury.norov@gmail.com>
Date: 2023-12-19 14:03:52
Also in: linux-rdma, lkml, netdev

On Tue, Dec 19, 2023 at 10:18:49AM +0000, Souradeep Chakrabarti wrote:

quoted

-----Original Message-----
From: Yury Norov <yury.norov@gmail.com>
Sent: Monday, December 18, 2023 3:02 AM
To: Souradeep Chakrabarti <redacted>; KY Srinivasan
[off-list ref]; Haiyang Zhang [off-list ref];
wei.liu@kernel.org; Dexuan Cui [off-list ref]; davem@davemloft.net;
edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li
[off-list ref]; yury.norov@gmail.com; leon@kernel.org;
cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
kernel@vger.kernel.org; linux-rdma@vger.kernel.org
Cc: Souradeep Chakrabarti <redacted>; Paul Rosswurm
[off-list ref]
Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per
CPUs

[Some people who received this message don't often get email from
yury.norov@gmail.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]

Souradeep investigated that the driver performs faster if IRQs are spread on CPUs
with the following heuristics:

1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second
priority; 3. Sibling dislocality is the last priority.

Let's consider this topology:

Node            0               1
Core        0       1       2       3
CPU       0   1   2   3   4   5   6   7

The most performant IRQ distribution based on the above topology and heuristics
may look like this:

IRQ     Nodes   Cores   CPUs
0       1       0       0-1
1       1       1       2-3
2       1       0       0-1
3       1       1       2-3
4       2       2       4-5
5       2       3       6-7
6       2       2       4-5
7       2       3       6-7

The irq_setup() routine introduced in this patch leverages the
for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as
described above.

According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on
cpumask_local_spread() performance test results look like this:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: #####  Totals:  #####
08:06:28 INFO: test duration    :60.00 seconds
08:06:28 INFO: total bytes      :630292053310
08:06:28 INFO:   throughput     :84.04Gbps
08:06:28 INFO:   retrans segs   :4
08:06:28 INFO: cpu cores        :192
08:06:28 INFO:   cpu speed      :3799.725MHz
08:06:28 INFO:   user           :0.05%
08:06:28 INFO:   system         :1.60%
08:06:28 INFO:   idle           :96.41%
08:06:28 INFO:   iowait         :0.00%
08:06:28 INFO:   softirq        :1.94%
08:06:28 INFO:   cycles/byte    :2.50
08:06:28 INFO: cpu busy (all)   :534.41%

For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: #####  Totals:  #####
08:09:56 INFO: test duration    :60.00 seconds
08:09:56 INFO: total bytes      :741966608384
08:09:56 INFO:   throughput     :98.93Gbps
08:09:56 INFO:   retrans segs   :6
08:09:56 INFO: cpu cores        :192
08:09:56 INFO:   cpu speed      :3799.791MHz
08:09:56 INFO:   user           :0.06%
08:09:56 INFO:   system         :1.81%
08:09:56 INFO:   idle           :96.18%
08:09:56 INFO:   iowait         :0.00%
08:09:56 INFO:   softirq        :1.95%
08:09:56 INFO:   cycles/byte    :2.25
08:09:56 INFO: cpu busy (all)   :569.22%

[1]
https://lore.kernel/
.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v
ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros
oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7
cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d
8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26
IhqkE%3D&reserved=0

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Souradeep Chakrabarti <redacted>
---
.../net/ethernet/microsoft/mana/gdma_main.c   | 28 +++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c

b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 6367de0c2c2e..11e64e42e3b2 100644

--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c

@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource

*r)
       r->size = 0;
}

+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int
+len, int node) {
+       const struct cpumask *next, *prev = cpu_none_mask;
+       cpumask_var_t cpus __free(free_cpumask_var);
+       int cpu, weight;
+
+       if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+               return -ENOMEM;
+
+       rcu_read_lock();
+       for_each_numa_hop_mask(next, node) {
+               weight = cpumask_weight_andnot(next, prev);
+               while (weight-- > 0) {

Make it while (weight > 0) {

quoted

+                       cpumask_andnot(cpus, next, prev);
+                       for_each_cpu(cpu, cpus) {
+                               if (len-- == 0)
+                                       goto done;
+                               irq_set_affinity_and_hint(*irqs++,
topology_sibling_cpumask(cpu));
+                               cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));

Here do --weight, else this code will traverse the same node N^2 times, where each
node has N cpus .

Sure.

When building your series on top of this, can you please fix it
inplace?

Thanks,
Yury

quoted

+                       }
+               }
+               prev = next;
+       }
+done:
+       rcu_read_unlock();
+       return 0;
+}
+
static int mana_gd_setup_irqs(struct pci_dev *pdev)  {
       unsigned int max_queues_per_port = num_online_cpus();
--
2.40.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help