Thread (16 messages) 16 messages, 4 authors, 2022-12-08

Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality

From: Yury Norov <yury.norov@gmail.com>
Date: 2022-11-15 18:33:08
Also in: linux-crypto, linux-rdma, lkml

On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
Hi,

On 12/11/22 11:09, Yury Norov wrote:
quoted
cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
Improve remote NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to their measurements, for mlx5e:

        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
Is this meant as a replacement for [1]?
No. Your series adds an iterator, and in my experience the code that
uses iterators of that sort is almost always better and easier to
understand than cpumask_nth() or cpumask_next()-like users.

My series has the only advantage that it allows keep existing codebase
untouched.
 
I like that this is changing an existing interface so that all current
users directly benefit from the change. Now, about half of the users of
cpumask_local_spread() use it in a loop with incremental @i parameter,
which makes the repeated bsearch a bit of a shame, but then I'm tempted to
say the first point makes it worth it.

[1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ (local)
In terms of very common case of sequential invocation of local_spread()
for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
and your approach is amortized O(n), which is better. Not a big deal _now_,
as you mentioned in the other email. But we never know how things will
evolve, right?

So, I would take both and maybe in comment to cpumask_local_spread()
mention that there's a better alternative for those who call the
function for all CPUs incrementally.

Thanks,
Yury
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help