Thread (19 messages) 19 messages, 5 authors, 2024-01-16

Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Souradeep Chakrabarti <hidden>
Date: 2024-01-16 06:13:45
Also in: linux-rdma, lkml, netdev

On Sat, Jan 13, 2024 at 11:11:50AM -0800, Yury Norov wrote:
On Sat, Jan 13, 2024 at 04:20:31PM +0000, Michael Kelley wrote:
quoted
From: Souradeep Chakrabarti <redacted> Sent: Friday, January 12, 2024 10:31 PM
quoted
On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
quoted
quoted
-----Original Message-----
From: Michael Kelley <redacted> Sent: Friday, January 12, 2024 11:37 AM
quoted
From: Souradeep Chakrabarti <redacted> Sent:
Wednesday, January 10, 2024 10:13 PM
quoted
The test topology was used to check the performance between
cpu_local_spread() and the new approach is :
Case 1
IRQ     Nodes  Cores CPUs
0       1      0     0-1
1       1      1     2-3
2       1      2     4-5
3       1      3     6-7

and with existing cpu_local_spread()
Case 2
IRQ    Nodes  Cores CPUs
0      1      0     0
1      1      0     1
2      1      1     2
3      1      1     3

Total 4 channels were used, which was set up by ethtool.
case 1 with ntttcp has given 15 percent better performance, than
case 2. During the test irqbalance was disabled as well.

Also you are right, with 64CPU system this approach will spread
the irqs like the cpu_local_spread() but in the future we will offer
MANA nodes, with more than 64 CPUs. There it this new design will
give better performance.

I will add this performance benefit details in commit message of
next version.
Here are my concerns:

1.  The most commonly used VMs these days have 64 or fewer
vCPUs and won't see any performance benefit.

2.  Larger VMs probably won't see the full 15% benefit because
all vCPUs in the local NUMA node will be assigned IRQs.  For
example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
vCPUs in NUMA node 0 will all be assigned IRQs.  The remaining
16 IRQs will be spread out on the 48 CPUs in NUMA node 1
in a way that avoids sharing a core.  But overall the means
that 75% of the IRQs will still be sharing a core and
presumably not see any perf benefit.

3.  Your experiment was on a relatively small scale:   4 IRQs
spread across 2 cores vs. across 4 cores.  Have you run any
experiments on VMs with 128 vCPUs (for example) where
most of the IRQs are not sharing a core?  I'm wondering if
the results with 4 IRQs really scale up to 64 IRQs.  A lot can
be different in a VM with 64 cores and 2 NUMA nodes vs.
4 cores in a single node.

4.  The new algorithm prefers assigning to all vCPUs in
each NUMA hop over assigning to separate cores.  Are there
experiments showing that is the right tradeoff?  What
are the results if assigning to separate cores is preferred?
I remember in a customer case, putting the IRQs on the same
NUMA node has better perf. But I agree, this should be re-tested
on MANA nic.
1) and 2) The change will not decrease the existing performance, but for
system with high number of CPU, will be benefited after this.

3) The result has shown around 6 percent improvement.

4)The test result has shown around 10 percent difference when IRQs are
spread on multiple numa nodes.
OK, this looks pretty good.  Make clear in the commit messages what
the tradeoffs are, and what the real-world benefits are expected to be.
Some future developer who wants to understand why IRQs are assigned
this way will thank you. :-)
I agree with Michael, this needs to be spoken aloud.
quoted
From the above, is that correct that the best performance is achieved
when the # of IRQs is half the nubmer of CPUs in the 1st node, because
this configuration allows to spread IRQs across cores the most optimal
way?  And if we have more or less than that, it hurts performance, at
least for MANA networking?
It does not decrease the performance from current cpu_local_spread(),
but optimum performance comes when node has CPUs double that of number
of IRQs (considering SMT==2). 

Now only if the number of CPUs are same that of number of IRQs,
(that is num of CPUs <= 64) then, we see same performance like existing
design with cpu_local_spread().

If node has more CPUs than 64, then we get better performance than 
cpu_local_spread().
So, the B|A performance chart may look like this, right?

  irq     nodes     cores     cpus      perf
  0       1 | 1     0 | 0     0 | 0-1      0%
  1       1 | 1     0 | 1     1 | 2-3     +5%
  2       1 | 1     1 | 2     2 | 4-5    +10%
  3       1 | 1     1 | 3     3 | 6-7    +15%
  4       1 | 1     0 | 4     3 | 0-1    +12%
  ...       |         |         |
  7       1 | 1     1 | 7     3 | 6-7      0%
  ...
 15       2 | 2     3 | 3    15 | 14-15    0%

Souradeep, can you please confirm that my understanding is correct?

In v5, can you add a table like the above with real performance
numbers for your driver? I think that it would help people to
configure their VMs better when networking is a bottleneck.
I will share a chart on next version of patch 3.
Thanks for the suggestion.
Thanks,
Yury
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help