RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs
From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2024-01-09 20:20:34
Also in:
linux-hyperv, linux-rdma, lkml
-----Original Message----- From: Michael Kelley <redacted> Sent: Tuesday, January 9, 2024 2:23 PM To: Souradeep Chakrabarti <redacted>; KY Srinivasan [off-list ref]; Haiyang Zhang [off-list ref]; wei.liu@kernel.org; Dexuan Cui [off-list ref]; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li [off-list ref]; yury.norov@gmail.com; leon@kernel.org; cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux- rdma@vger.kernel.org Cc: Souradeep Chakrabarti <redacted>; Paul Rosswurm [off-list ref] Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs [Some people who received this message don't often get email from mhklinux@outlook.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] From: Souradeep Chakrabarti <redacted> Sent: Tuesday, January 9, 2024 2:51 AMquoted
From: Yury Norov <yury.norov@gmail.com> Souradeep investigated that the driver performs faster if IRQs are spread on CPUs with the following heuristics: 1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second priority; 3. Sibling dislocality is the last priority. Let's consider this topology: Node 0 1 Core 0 1 2 3 CPU 0 1 2 3 4 5 6 7 The most performant IRQ distribution based on the above topology and heuristics may look like this: IRQ Nodes Cores CPUs 0 1 0 0-1 1 1 1 2-3 2 1 0 0-1 3 1 1 2-3 4 2 2 4-5 5 2 3 6-7 6 2 2 4-5 7 2 3 6-7I didn't pay attention to the detailed discussion of this issue over the past 2 to 3 weeks during the holidays in the U.S., but the above doesn't align with the original problem as I understood it. I thought the original problem was to avoid putting IRQs on both hyper-threads in the same core, and that the perf improvements are based on that configuration. At least that's what the commit message for Patch 4/4 in this series says. The above chart results in 8 IRQs being assigned to the 8 CPUs, probably with 1 IRQ per CPU. At least on x86, if the affinity mask for an IRQ contains multiple CPUs, matrix_find_best_cpu() should balance the IRQ assignments between the CPUs in the mask. So the original problem is still present because both hyper-threads in a core are likely to have an IRQ assigned. Of course, this example has 8 IRQs and 8 CPUs, so assigning an IRQ to every hyper-thread may be the only choice. If that's the case, maybe this just isn't a good example to illustrate the original problem and solution. But even with a better example where the # of IRQs is <= half the # of CPUs in a NUMA node, I don't think the code below accomplishes the original intent. Maybe I've missed something along the way in getting to this version of the patch. Please feel free to set me straight. :-) Michael
I have the same question as Michael. Also, I'm asking Souradeep in another channel: So, the algorithm still uses up all current NUMA node before moving on to the next NUMA node, right? Except each IRQ is affinitized to 2 CPUs. For example, a system with 2 IRQs: IRQ Nodes Cores CPUs 0 1 0 0-1 1 1 1 2-3 Is this performing better than the algorithm in earlier patches? like below: IRQ Nodes Cores CPUs 0 1 0 0 1 1 1 2 Thanks, - Haiyang