Re: [PATCH net-next v4] net: skb: introduce and use a single page frag cache
From: Eric Dumazet <edumazet@google.com>
Date: 2022-10-03 22:31:13
On Mon, Oct 3, 2022 at 2:40 PM Paolo Abeni [off-list ref] wrote:
On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:quoted
On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:quoted
On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni [off-list ref] wrote:quoted
On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:quoted
Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) Before the patch, cpus servicing NIC interrupts were allocating SLAB/SLUB objects for incoming packets, but they were also freeing skbs from TCP rtx queues when ACK packets were processed. SLAB/SLUB caches were efficient (hit ratio close to 100%)Thank you for the report. Is that reproducible with netperf TCP_RR and CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD host with 256 cpus...The most similar host I can easily grab is a 2 numa nodes AMD with 16 cores/32 threads each. I tried tcp_rr with different number of flows in (1-2048) range with both slub (I tried first that because is the default allocator in my builds) and slab but so far I can't reproduce it: results differences between pre-patch and post-patch kernel are within noise and numastat never show misses. I'm likely missing some incredient to the recipe. I'll try next to pin the netperf/netserver processes on a numa node different from the NIC's one and to increase the number of concurrent flows.I'm still stuck trying to reproduce the issue. I tried pinning and increasing the flows numbers, but I could not find a scenario with a clear regression. I see some contention, but it's related to the timer wheel spinlock, and independent from my patch.
I was using neper ( https://github.com/google/neper ), with 100 threads and 4000 flows. I suspect contention is hard to see unless you use a host with at least 128 cores. Also you need a lot of NIC receive queues (or use RPS to spread incoming packets to many cores)
Which kind of delta should I observe? Could you please share any additional setup hints?quoted
I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb freeing deferral to per-cpu lists") the CPUs servicing the NIC interrupts both allocate and (defer) free the memory for the incoming packets, so they should not have to access remote caches ?!? Or at least isn't the allocator behavior always asymmetric, with rx alloc, rx free, tx free on the same core and tx alloc possibly on a different one?quoted
We could first try in tcp_stream_alloc_skb()I'll try to test something alike the following - after reproducing the issue. ---diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f15d5b62539b..d5e9be98e8bd 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h@@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); } +#if PAGE_SIZE == SZ_4K + +#define NAPI_HAS_SMALL_PAGE_FRAG 1 + +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority); + +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) +{ + if (size <= SKB_WITH_OVERHEAD(1024)) + return __alloc_skb_fclone_frag(size, priority); + + return alloc_skb_fclone(size, priority); +} + +#else +#define NAPI_HAS_SMALL_PAGE_FRAG 0 + +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) +{ + return alloc_skb_fclone(size, priority); +} + +#endif + struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); void skb_headers_offset_update(struct sk_buff *skb, int off); int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 81d63f95e865..0c63653c9951 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c@@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) #define NAPI_SKB_CACHE_BULK 16 #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) -#if PAGE_SIZE == SZ_4K +#if NAPI_HAS_SMALL_PAGE_FRAG -#define NAPI_HAS_SMALL_PAGE_FRAG 1 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc) /* specialized page frag allocator using a single order 0 page@@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp) nc->offset = offset; return nc->va + offset; } + #else /* the small page is actually unused in this build; add dummy helpers * to please the compiler and avoid later preprocessor's conditionals */ -#define NAPI_HAS_SMALL_PAGE_FRAG 0 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false struct page_frag_1k {@@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, } EXPORT_SYMBOL(__alloc_skb); +#if NAPI_HAS_SMALL_PAGE_FRAG +/* optimized skb fast-clone allocation variant, using the small + * page frag cache + */ +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask) +{ + struct sk_buff_fclones *fclones; + struct napi_alloc_cache *nc; + struct sk_buff *skb; + bool pfmemalloc; + void *data; + + /* Get the HEAD */ + skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE); + if (unlikely(!skb)) + return NULL; + prefetchw(skb); + + /* We can access the napi cache only in BH context */ + nc = this_cpu_ptr(&napi_alloc_cache); + local_bh_disable();For the record, the above is wrong, this_cpu_ptr must be after the local_bh_disable() call.
Can you send me (privately maybe) an updated patch ? Thanks.