RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: 2024-01-09 20:20:34
Also in: linux-hyperv, linux-rdma, lkml

-----Original Message-----
From: Michael Kelley <redacted>
Sent: Tuesday, January 9, 2024 2:23 PM
To: Souradeep Chakrabarti <redacted>; KY Srinivasan
[off-list ref]; Haiyang Zhang [off-list ref];
wei.liu@kernel.org; Dexuan Cui [off-list ref];
davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
pabeni@redhat.com; Long Li [off-list ref]; yury.norov@gmail.com;
leon@kernel.org; cai.huoqing@linux.dev; ssengar@linux.microsoft.com;
vkuznets@redhat.com; tglx@linutronix.de; linux-hyperv@vger.kernel.org;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
rdma@vger.kernel.org
Cc: Souradeep Chakrabarti <redacted>; Paul Rosswurm
[off-list ref]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per
CPUs

[Some people who received this message don't often get email from
mhklinux@outlook.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]

From: Souradeep Chakrabarti <redacted> Sent:
Tuesday, January 9, 2024 2:51 AM

quoted

From: Yury Norov <yury.norov@gmail.com>

Souradeep investigated that the driver performs faster if IRQs are
spread on CPUs with the following heuristics:

1. No more than one IRQ per CPU, if possible;
2. NUMA locality is the second priority;
3. Sibling dislocality is the last priority.

Let's consider this topology:

Node            0               1
Core        0       1       2       3
CPU       0   1   2   3   4   5   6   7

The most performant IRQ distribution based on the above topology
and heuristics may look like this:

IRQ     Nodes   Cores   CPUs
0       1       0       0-1
1       1       1       2-3
2       1       0       0-1
3       1       1       2-3
4       2       2       4-5
5       2       3       6-7
6       2       2       4-5
7       2       3       6-7

I didn't pay attention to the detailed discussion of this issue
over the past 2 to 3 weeks during the holidays in the U.S., but
the above doesn't align with the original problem as I understood
it.  I thought the original problem was to avoid putting IRQs on
both hyper-threads in the same core, and that the perf
improvements are based on that configuration.  At least that's
what the commit message for Patch 4/4 in this series says.

The above chart results in 8 IRQs being assigned to the 8 CPUs,
probably with 1 IRQ per CPU.   At least on x86, if the affinity
mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
should balance the IRQ assignments between the CPUs in the mask.
So the original problem is still present because both hyper-threads
in a core are likely to have an IRQ assigned.

Of course, this example has 8 IRQs and 8 CPUs, so assigning an
IRQ to every hyper-thread may be the only choice.  If that's the
case, maybe this just isn't a good example to illustrate the
original problem and solution.  But even with a better example
where the # of IRQs is <= half the # of CPUs in a NUMA node,
I don't think the code below accomplishes the original intent.

Maybe I've missed something along the way in getting to this
version of the patch.  Please feel free to set me straight. :-)

Michael

I have the same question as Michael. Also, I'm asking Souradeep
in another channel: So, the algorithm still uses up all current 
NUMA node before moving on to the next NUMA node, right?

Except each IRQ is affinitized to 2 CPUs. 
For example, a system with 2 IRQs:
IRQ     Nodes   Cores  CPUs
0       1       0      0-1
1       1       1      2-3
 
Is this performing better than the algorithm in earlier patches? like below:
IRQ     Nodes   Cores  CPUs
0       1       0      0
1       1       1      2

Thanks,
- Haiyang

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help