Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: 2024-08-13 09:51:52
Also in:
bpf, lkml
On 13/08/2024 03.33, Jakub Kicinski wrote:
On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:quoted
But I think one solution could be: 1. We create some generic structure for cpumap, like struct cpumap_meta { u32 magic; u32 hash; } 2. We add such check in the cpumap code if (xdpf->metalen == sizeof(struct cpumap_meta) && <here we check magic>) skb->hash = meta->hash; 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.I wonder what the overhead of skb metadata allocation is in practice. With Eric's "return skb to the CPU of origin" we can feed the lockless skb cache one the right CPU, and also feed the lockless page pool cache. I wonder if batched RFS wouldn't be faster than the XDP thing that requires all the groundwork.
I explicitly developed CPUMAP because I was benchmarking Receive Flow Steering (RFS) and Receive Packet Steering (RPS), which I observed was the bottleneck. The overhead was too large on the RX-CPU and bottleneck due to RFS and RPS maintaining data structures to avoid Out-of-Order packets. The Flow Dissector step was also a limiting factor. By bottleneck I mean it didn't scale, as RX-CPU packet per second processing speeds was too low compared to the remote-CPU pps. Digging in my old notes, I can see that RPS was limited to around 4.8 Mpps (and I have a weird disabling part of it showing 7.5Mpps). In [1] remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus it only scales up-to 1.78 remote-CPUs. [1] shows how optimizations brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for UdpNoPorts case). XDP RX-redirect in [1]+[2] was around 19Mpps (which might be lower today due to perf paper cuts). [1] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org [2] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org The benefits Eric's "return skb to the CPU of origin" should help improve the case for the remote-CPU, as I was seeing some bottlenecks in how we returned the memory. --Jesper