Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
From: Yury Norov <yury.norov@gmail.com>
Date: 2022-11-15 18:33:08
Also in:
linux-crypto, linux-rdma, lkml
On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
Hi, On 12/11/22 11:09, Yury Norov wrote:quoted
cpumask_local_spread() currently checks local node for presence of i'th CPU, and then if it finds nothing makes a flat search among all non-local CPUs. We can do it better by checking CPUs per NUMA hops. This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints" https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ According to their measurements, for mlx5e: Bottleneck in RX side is released, reached linerate (~1.8x speedup). ~30% less cpu util on TX. This patch makes cpumask_local_spread() traversing CPUs based on NUMA distance, just as well, and I expect comparabale improvement for its users, as in case of mlx5e. I tested new behavior on my VM with the following NUMA configuration: root@debian:~# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 node 0 size: 3869 MB node 0 free: 3740 MB node 1 cpus: 4 5 node 1 size: 1969 MB node 1 free: 1937 MB node 2 cpus: 6 7 node 2 size: 1967 MB node 2 free: 1873 MB node 3 cpus: 8 9 10 11 12 13 14 15 node 3 size: 7842 MB node 3 free: 7723 MB node distances: node 0 1 2 3 0: 10 50 30 70 1: 50 10 70 30 2: 30 70 10 50 3: 70 30 50 10 And the cpumask_local_spread() for each node and offset traversing looks like this: node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3Is this meant as a replacement for [1]?
No. Your series adds an iterator, and in my experience the code that uses iterators of that sort is almost always better and easier to understand than cpumask_nth() or cpumask_next()-like users. My series has the only advantage that it allows keep existing codebase untouched.
I like that this is changing an existing interface so that all current users directly benefit from the change. Now, about half of the users of cpumask_local_spread() use it in a loop with incremental @i parameter, which makes the repeated bsearch a bit of a shame, but then I'm tempted to say the first point makes it worth it. [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ (local)
In terms of very common case of sequential invocation of local_spread() for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, and your approach is amortized O(n), which is better. Not a big deal _now_, as you mentioned in the other email. But we never know how things will evolve, right? So, I would take both and maybe in comment to cpumask_local_spread() mention that there's a better alternative for those who call the function for all CPUs incrementally. Thanks, Yury