Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head

From: Eric Dumazet <edumazet@google.com>
Date: 2026-03-01 16:30:34

On Sun, Mar 1, 2026 at 12:24 PM Vlastimil Babka [off-list ref] wrote:

On 2/28/26 15:12, Eric Dumazet wrote:

quoted

skbuff_small_head is used both on receive and send paths,
serving potentially 80 million allocations and frees per second.

Tuning it on large servers has been problematic, especially
on AMD Turins platforms, where "lock cmpxch16b" latency can
be over 30,000 cycles.

Huh, really? That sounds insane. Any pointers about that?

Yes, obviously on semi-contended cache lines.

quoted

Switching to SLUB sheaves fixes the issue nicely.

tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
on AMD Turin.

Other platforms show benefits with tcp_rr with more than 30,000
flows.

That's nice, thanks!

However I must point out some caveates. I assume you did this on 6.19, where
sheaves are still opt-in. But also, when you opt-in, the pre-existing
per-cpu caching layer of percpu slab and percpu partial slabs is also still
there, so effectively the amount of percpu cached slab objects increase,
which can be the main performance difference for some workloads, and not the
difference between sheaves and percpu (partial) slabs implementation.

Tests are on 6.18 LTS kernel, on which our latest production kernel is based.

Note: but hopefully for your workload it's really the implementation.
"(lock) cmpxch16b" should be avoided, until you start freeing NUMA-remote
(to the freeing cpu) objects in significant volumes.

Right, __slab_free() is absolutely not 'slow path' when we have
~80,000 in-flight objects
on a 512 cpu host.

In 7.0-rc1 sheaves are enabled for every cache automatically, and cpu
(partial) caches are gone completely. Their size is calculated to roughly
match the average amount of percpu caching the old scheme achieved (but that
effectively depended on the workload too, so can't be exactly translated)
and the result is visible in /sys/kernel/slab/$cache/sheaf_capacity
the args.sheaf_capacity can override that automatic sizing, if the specified
one is larger.

Nice, I did not know that (I am not following lkml traffic)

So what I would suggest is checking the performance betwen 6.19 and 7.0-rc1
without this patch (hope there won't be any other factors in the upgrade
influencing this much), noting the auto-calculated capacity. If it still
looks good, you don't need to do anything, otherwise you can try making the
capacity larger and see what happens.

I can not test this yet using 7.0-rc1.

I guess we will carry this patch privately, and will come back in a
few months when
I can get our infra ready.

Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help