Re: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per CPUs
From: Yury Norov <yury.norov@gmail.com>
Date: 2023-12-19 14:03:52
Also in:
linux-rdma, lkml, netdev
On Tue, Dec 19, 2023 at 10:18:49AM +0000, Souradeep Chakrabarti wrote:
quoted
-----Original Message----- From: Yury Norov <yury.norov@gmail.com> Sent: Monday, December 18, 2023 3:02 AM To: Souradeep Chakrabarti <redacted>; KY Srinivasan [off-list ref]; Haiyang Zhang [off-list ref]; wei.liu@kernel.org; Dexuan Cui [off-list ref]; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li [off-list ref]; yury.norov@gmail.com; leon@kernel.org; cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux- kernel@vger.kernel.org; linux-rdma@vger.kernel.org Cc: Souradeep Chakrabarti <redacted>; Paul Rosswurm [off-list ref] Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per CPUs [Some people who received this message don't often get email from yury.norov@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] Souradeep investigated that the driver performs faster if IRQs are spread on CPUs with the following heuristics: 1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second priority; 3. Sibling dislocality is the last priority. Let's consider this topology: Node 0 1 Core 0 1 2 3 CPU 0 1 2 3 4 5 6 7 The most performant IRQ distribution based on the above topology and heuristics may look like this: IRQ Nodes Cores CPUs 0 1 0 0-1 1 1 1 2-3 2 1 0 0-1 3 1 1 2-3 4 2 2 4-5 5 2 3 6-7 6 2 2 4-5 7 2 3 6-7 The irq_setup() routine introduced in this patch leverages the for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as described above. According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on cpumask_local_spread() performance test results look like this: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:05:20 INFO: 17 threads created 08:05:28 INFO: Network activity progressing... 08:06:28 INFO: Test run completed. 08:06:28 INFO: Test cycle finished. 08:06:28 INFO: ##### Totals: ##### 08:06:28 INFO: test duration :60.00 seconds 08:06:28 INFO: total bytes :630292053310 08:06:28 INFO: throughput :84.04Gbps 08:06:28 INFO: retrans segs :4 08:06:28 INFO: cpu cores :192 08:06:28 INFO: cpu speed :3799.725MHz 08:06:28 INFO: user :0.05% 08:06:28 INFO: system :1.60% 08:06:28 INFO: idle :96.41% 08:06:28 INFO: iowait :0.00% 08:06:28 INFO: softirq :1.94% 08:06:28 INFO: cycles/byte :2.50 08:06:28 INFO: cpu busy (all) :534.41% For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:08:51 INFO: 17 threads created 08:08:56 INFO: Network activity progressing... 08:09:56 INFO: Test run completed. 08:09:56 INFO: Test cycle finished. 08:09:56 INFO: ##### Totals: ##### 08:09:56 INFO: test duration :60.00 seconds 08:09:56 INFO: total bytes :741966608384 08:09:56 INFO: throughput :98.93Gbps 08:09:56 INFO: retrans segs :6 08:09:56 INFO: cpu cores :192 08:09:56 INFO: cpu speed :3799.791MHz 08:09:56 INFO: user :0.06% 08:09:56 INFO: system :1.81% 08:09:56 INFO: idle :96.18% 08:09:56 INFO: iowait :0.00% 08:09:56 INFO: softirq :1.95% 08:09:56 INFO: cycles/byte :2.25 08:09:56 INFO: cpu busy (all) :569.22% [1] https://lore.kernel/ .org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7 cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% 7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26 IhqkE%3D&reserved=0 Signed-off-by: Yury Norov <yury.norov@gmail.com> Co-developed-by: Souradeep Chakrabarti <redacted> --- .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+)diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.cb/drivers/net/ethernet/microsoft/mana/gdma_main.c index 6367de0c2c2e..11e64e42e3b2 100644--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource*r) r->size = 0; } +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int +len, int node) { + const struct cpumask *next, *prev = cpu_none_mask; + cpumask_var_t cpus __free(free_cpumask_var); + int cpu, weight; + + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) + return -ENOMEM; + + rcu_read_lock(); + for_each_numa_hop_mask(next, node) { + weight = cpumask_weight_andnot(next, prev); + while (weight-- > 0) {Make it while (weight > 0) {quoted
+ cpumask_andnot(cpus, next, prev); + for_each_cpu(cpu, cpus) { + if (len-- == 0) + goto done; + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu)); + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));Here do --weight, else this code will traverse the same node N^2 times, where each node has N cpus .
Sure. When building your series on top of this, can you please fix it inplace? Thanks, Yury
quoted
+ } + } + prev = next; + } +done: + rcu_read_unlock(); + return 0; +} + static int mana_gd_setup_irqs(struct pci_dev *pdev) { unsigned int max_queues_per_port = num_online_cpus(); -- 2.40.1