Re: [PATCH 2/2] net: skb: isolate skb data area allocations into a separate bucket
From: Eric Dumazet <edumazet@google.com>
Date: 2026-06-05 07:25:12
Also in:
linux-hardening, linux-mm, lkml
On Thu, Jun 4, 2026 at 10:45 PM Harry Yoo [off-list ref] wrote:
On 6/5/26 4:12 AM, Pedro Falcato wrote:quoted
On Thu, Jun 04, 2026 at 02:30:34PM +0900, Harry Yoo wrote:quoted
On 6/3/26 3:31 AM, Pedro Falcato wrote:quoted
SKB data area allocations (as done from alloc_skb()) use kmalloc(). These allocations can be variably sized and their contents can be more or less controlled from userspace, which makes them useful for attackers that want to overwrite a use-after-free'd object from the same kmalloc slab (which often just requires the sizes to roughly match into the same kmalloc bucket). [0] is an easy example of an exploit that uses netlink skb allocation to target another similarly-sized accidentally freed object. While other mitigations like CONFIG_RANDOM_KMALLOC_CACHES exist, these are probabilistic. Use the existing kmem buckets API to further isolate these allocations in a guaranteed fashion, when CONFIG_SLAB_BUCKETS=y. Link: https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-4207_lts_cos_mitigation_2/docs/exploit.md [0] Signed-off-by: Pedro Falcato <pfalcato@suse.de> --- net/core/skbuff.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 44a7f8401468..1f6c6b531ece 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c@@ -594,6 +594,8 @@ static void *kmalloc_pfmemalloc(size_t obj_size, gfp_t flags, int node) return kmalloc_node_track_caller(obj_size, flags, node); } +static kmem_buckets *skb_data_buckets __ro_after_init; + /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and@@ -632,7 +634,7 @@ static void *kmalloc_reserve(unsigned int *size, gfp_t flags, int node, * Try a regular allocation, when that fails and we're not entitled * to the reserves, fail. */ - obj = kmalloc_node_track_caller(obj_size, + obj = kmem_buckets_alloc_node_track_caller(skb_data_buckets, obj_size, flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node); if (likely(obj))What about kmalloc_pfmemalloc()?Good point, that looks free as well. Sidenote: isolating kmem_cache_alloc for possibly-aliasing caches could also be useful. skb allocation has net_hotdata.skb_small_head_cache. It doesn't merge with anything for $raisins (odd size, plus I don't think usercopy caches are getting merged?) but it feels too... accidental?Right, we never merge caches with useroffset/usersize. Hmm... /* SKB_SMALL_HEAD_CACHE_SIZE is the size used for the skbuff_small_head * kmem_cache. The non-power-of-2 padding is kept for historical reasons and * to avoid potential collisions with generic kmalloc bucket sizes. */ #define SKB_SMALL_HEAD_CACHE_SIZE \ (is_power_of_2(SKB_SMALL_HEAD_SIZE) ? \ (SKB_SMALL_HEAD_SIZE + L1_CACHE_BYTES) : \ SKB_SMALL_HEAD_SIZE) What are "historical reasons" other than avoiding collisions with kmalloc caches?
git log/blame might help :)
commit 0f42e3f4fe2a58394e37241d02d9ca6ab7b7d516
Author: Jiayuan Chen [off-list ref]
Date: Fri Apr 3 09:45:12 2026 +0800
net: skb: fix cross-cache free of KFENCE-allocated skb head
Note that MAX_SKB_FRAGS can be tuned.
config MAX_SKB_FRAGS
int "Maximum number of fragments per skb_shared_info"
range 17 45
default 17
help
Having more fragments per skb_shared_info can help GRO efficiency.
This helps BIG TCP workloads, but might expose bugs in some
legacy drivers.
This also increases memory overhead of small packets,
and in drivers using build_skb().
If unsure, say 17.
quoted
Maybe passing something like SLAB_NO_MERGE and making the size standard-looking would be nice. I have a size of 704 bytes per object, and this probably causes some weird wastage for each slab.Yes, unless the "historical reasons" do not make it infeasible to do that. And I wonder if net/core/skbuff.c intends to always prevent merging, or only with hardening configs like SLAB_BUCKETS.
We do not care anymore of merging or not.