Thread (98 messages) 98 messages, 14 authors, 2024-08-21

Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

From: Daniel Xu <hidden>
Date: 2024-08-21 00:29:52
Also in: bpf, lkml
Subsystem: bpf [general] (safe dynamic programs and tools), the rest, xdp (express data path) · Maintainers: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi, Linus Torvalds, David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend

Hi Olek,

On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
[..]
quoted
Thanks A LOT for doing this benchmarking!
I optimized the code a bit and picked my old patches for bulk NAPI skb
cache allocation and today I got 4.7 Mpps 🎉
IOW, the result of the series (7 patches totally, but 2 are not
networking-related) is 2.7 -> 4.7 Mpps == 75%!

Daniel,

if you want, you can pick my tree[0], either full or just up to

"bpf: cpumap: switch to napi_skb_cache_get_bulk()"

(13 patches total: 6 for netdev_feature_t and 7 for the cpumap)

and test with your usecases. Would be nice to see some real world
results, not my synthetic tests :D
quoted
--Jesper
[0]
https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
So it turns out keeping the workload in place while I update and reboot
the kernel is a Hard Problem. I'll put in some more effort and see if I
can get one of the workloads to stay still, but it'll be a somewhat
noisy test even if it works. So the following are synthetic tests
(neper) but on a real prod setup as far as container networking and
configuration is concerned.

I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
skip some of the flag refactors b/c of conflicts - I didn't know the
code well enough to do fixups. So I had to apply this diff (FWIW not sure
the struct_size() here was right anyhow):
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 089d19c62efe..359fbfaa43eb 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
-	dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
+	dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
 	if (!dev)
 		goto free_cpu_map;
 
==== Baseline ===
	./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30				./tcp_stream -c -H $SERVER -T8 -F16 -l30

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2578189	        0.00008831	0.00010623	0.00013439		Run 1	15427.22
Run 2	2657923	        0.00008575	0.00010239	0.00012927		Run 2	15272.12
Run 3	2700402	        0.00008447	0.00010111	0.00013183		Run 3	14871.35
Run 4	2571739	        0.00008575	0.00011519	0.00013823		Run 4	15344.72
Run 5	2476427	        0.00008703	0.00013055	0.00016895		Run 5	15193.2
Average	2596936	        0.000086262	0.000111094	0.000140534		Average	15221.722

=== cpumap NAPI patches ===
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2554598	        0.00008703	0.00011263	0.00013055		Run 1	17090.29
Run 2	2478905	        0.00009087	0.00011391	0.00014463		Run 2	16742.27
Run 3	2418599	        0.00009471	0.00011007	0.00014207		Run 3	17555.3
Run 4	2562463	        0.00008959	0.00010367	0.00013055		Run 4	17892.3
Run 5	2716551	        0.00008127	0.00010879	0.00013439		Run 5	17578.32
Average	2546223.2	0.000088694	0.000109814	0.000136438		Average	17371.696
Delta	-1.95%	        2.82%	        -1.15%	        -2.91%			        14.12%


So it looks like the GRO patches work quite well out of the box. It's
curious that tcp_rr transactions go down a bit, though. I don't have any
intuition around that.

Lemme know if you wanna change some stuff and get a rerun.

Thanks,
Daniel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help