Re: [PATCH net-next v4] net: skb: introduce and use a single page frag cache

From: Eric Dumazet <edumazet@google.com>
Date: 2022-10-03 22:31:13

On Mon, Oct 3, 2022 at 2:40 PM Paolo Abeni [off-list ref] wrote:

On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:

quoted

On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:

quoted

On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni [off-list ref] wrote:

quoted

On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:

quoted

Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)

Before the patch, cpus servicing NIC interrupts were allocating
SLAB/SLUB objects for incoming packets,
but they were also freeing skbs from TCP rtx queues when ACK packets
were processed. SLAB/SLUB caches
were efficient (hit ratio close to 100%)

Thank you for the report. Is that reproducible with netperf TCP_RR and
CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?

No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
host with 256 cpus...

The most similar host I can easily grab is a 2 numa nodes AMD with 16
cores/32 threads each.

I tried tcp_rr with different number of flows in (1-2048) range with
both slub (I tried first that because is the default allocator in my
builds) and slab but so far I can't reproduce it: results differences
between pre-patch and post-patch kernel are within noise and numastat
never show misses.

I'm likely missing some incredient to the recipe. I'll try next to pin
the netperf/netserver processes on a numa node different from the NIC's
one and to increase the number of concurrent flows.

I'm still stuck trying to reproduce the issue. I tried pinning and
increasing the flows numbers, but I could not find a scenario with a
clear regression. I see some contention, but it's related to the timer
wheel spinlock, and independent from my patch.

I was using neper ( https://github.com/google/neper ), with 100
threads and 4000 flows.

I suspect contention is hard to see unless you use a host with at
least 128 cores.
Also you need a lot of NIC receive queues (or use RPS to spread
incoming packets to many cores)

Which kind of delta should I observe? Could you please share any
additional setup hints?

quoted

I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
freeing deferral to per-cpu lists") the CPUs servicing the NIC
interrupts both allocate and (defer) free the memory for the incoming
packets,  so they should not have to access remote caches ?!? Or at
least isn't the allocator behavior always asymmetric, with rx alloc, rx
free, tx free on the same core and tx alloc possibly on a different
one?

quoted

We could first try in tcp_stream_alloc_skb()

I'll try to test something alike the following - after reproducing the
issue.
---

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f15d5b62539b..d5e9be98e8bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h

@@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
      return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
 }

+#if PAGE_SIZE == SZ_4K
+
+#define NAPI_HAS_SMALL_PAGE_FRAG     1
+
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+     if (size <= SKB_WITH_OVERHEAD(1024))
+             return __alloc_skb_fclone_frag(size, priority);
+
+     return alloc_skb_fclone(size, priority);
+}
+
+#else
+#define NAPI_HAS_SMALL_PAGE_FRAG     0
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+     return alloc_skb_fclone(size, priority);
+}
+
+#endif
+
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 81d63f95e865..0c63653c9951 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c

@@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
 #define NAPI_SKB_CACHE_BULK  16
 #define NAPI_SKB_CACHE_HALF  (NAPI_SKB_CACHE_SIZE / 2)

-#if PAGE_SIZE == SZ_4K
+#if NAPI_HAS_SMALL_PAGE_FRAG

-#define NAPI_HAS_SMALL_PAGE_FRAG     1
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       ((nc).pfmemalloc)

 /* specialized page frag allocator using a single order 0 page

@@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
      nc->offset = offset;
      return nc->va + offset;
 }
+
 #else

 /* the small page is actually unused in this build; add dummy helpers
  * to please the compiler and avoid later preprocessor's conditionals
  */
-#define NAPI_HAS_SMALL_PAGE_FRAG     0
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       false

 struct page_frag_1k {

@@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 }
 EXPORT_SYMBOL(__alloc_skb);

+#if NAPI_HAS_SMALL_PAGE_FRAG
+/* optimized skb fast-clone allocation variant, using the small
+ * page frag cache
+ */
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
+{
+     struct sk_buff_fclones *fclones;
+     struct napi_alloc_cache *nc;
+     struct sk_buff *skb;
+     bool pfmemalloc;
+     void *data;
+
+     /* Get the HEAD */
+     skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
+     if (unlikely(!skb))
+             return NULL;
+     prefetchw(skb);
+
+     /* We can access the napi cache only in BH context */
+     nc = this_cpu_ptr(&napi_alloc_cache);
+     local_bh_disable();

For the record, the above is wrong, this_cpu_ptr must be after the
local_bh_disable() call.

Can you send me (privately maybe) an updated patch ?
Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help