Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
From: Eric Dumazet <edumazet@google.com>
Date: 2026-03-01 16:30:34
On Sun, Mar 1, 2026 at 12:24 PM Vlastimil Babka [off-list ref] wrote:
On 2/28/26 15:12, Eric Dumazet wrote:quoted
skbuff_small_head is used both on receive and send paths, serving potentially 80 million allocations and frees per second. Tuning it on large servers has been problematic, especially on AMD Turins platforms, where "lock cmpxch16b" latency can be over 30,000 cycles.Huh, really? That sounds insane. Any pointers about that?
Yes, obviously on semi-contended cache lines.
quoted
Switching to SLUB sheaves fixes the issue nicely. tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps on AMD Turin. Other platforms show benefits with tcp_rr with more than 30,000 flows.That's nice, thanks! However I must point out some caveates. I assume you did this on 6.19, where sheaves are still opt-in. But also, when you opt-in, the pre-existing per-cpu caching layer of percpu slab and percpu partial slabs is also still there, so effectively the amount of percpu cached slab objects increase, which can be the main performance difference for some workloads, and not the difference between sheaves and percpu (partial) slabs implementation.
Tests are on 6.18 LTS kernel, on which our latest production kernel is based.
Note: but hopefully for your workload it's really the implementation. "(lock) cmpxch16b" should be avoided, until you start freeing NUMA-remote (to the freeing cpu) objects in significant volumes.
Right, __slab_free() is absolutely not 'slow path' when we have ~80,000 in-flight objects on a 512 cpu host.
In 7.0-rc1 sheaves are enabled for every cache automatically, and cpu (partial) caches are gone completely. Their size is calculated to roughly match the average amount of percpu caching the old scheme achieved (but that effectively depended on the workload too, so can't be exactly translated) and the result is visible in /sys/kernel/slab/$cache/sheaf_capacity the args.sheaf_capacity can override that automatic sizing, if the specified one is larger.
Nice, I did not know that (I am not following lkml traffic)
So what I would suggest is checking the performance betwen 6.19 and 7.0-rc1 without this patch (hope there won't be any other factors in the upgrade influencing this much), noting the auto-calculated capacity. If it still looks good, you don't need to do anything, otherwise you can try making the capacity larger and see what happens.
I can not test this yet using 7.0-rc1. I guess we will carry this patch privately, and will come back in a few months when I can get our infra ready. Thanks.