Re: [PATCH net] net/sched: fix pedit partial COW leading to page cache corruption
From: Han Guidong <hidden>
Date: 2026-05-22 07:50:13
On Thu, May 21, 2026 at 6:15 PM Jamal Hadi Salim [off-list ref] wrote:
On Thu, May 21, 2026 at 5:53 AM Jamal Hadi Salim [off-list ref] wrote:quoted
:: On Wed, May 20, 2026 at 4:00 PM Jamal Hadi Salim [off-list ref] wrote:quoted
On Wed, May 20, 2026 at 5:23 AM Jamal Hadi Salim [off-list ref] wrote:quoted
On Mon, May 18, 2026 at 11:42 PM Rajat Gupta [off-list ref] wrote:quoted
tcf_pedit_act() computes the COW range for skb_ensure_writable() once before the key loop using tcfp_off_max_hint, but the hint does not account for the runtime header offset added by typed keys. This can leave part of the write region un-COW'd. Fix by moving skb_ensure_writable() inside the per-key loop where the actual write offset is known, and add overflow checking on the offset arithmetic. For negative offsets (e.g. Ethernet header edits at ingress), use skb_cow() to COW the headroom instead. Guard offset_valid() against INT_MIN, where negation is undefined. Additionally, linearize skbs with shared frags upfront to prevent silent data corruption when pedit operates on zero-copy pages (e.g. from sendfile).Sashiko makes the claim that: " .... ..... If the payload part extending past offset 0 resides in a shared fragment for a cloned skb, wouldn't skb_store_bits() write the positive offset portion directly into the shared page, causing a data corruption regression? " It missed the earlier code (before the quoted code is hit) which is very well documented! /* * If the skb has shared frags the user is likely using zero-copy * (e.g. sendfile). Those page frags may point to page-cache pages; * writing into them would silently corrupt the page cache. * Linearize so pedit operates on a private copy. * TL;DR: if you want zero-copy, don't use pedit. */ if (skb_has_shared_frag(skb)) { if (__skb_linearize(skb)) goto bad; } TLDR: If the skb has shared frags, it is linearized upfront using __skb_linearize() which means any risk of modifying a shared fragment is non-existent.It seems there's a second sashiko ive been made aware of - these things are just multiplying ;-> https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260519033950.2037-1-rajat.gupta%40oss.qualcomm.com Both points it makes are valid. Dont have time right now. And infact looking at the second point sashiko 2 made I realize I was only paying attention to externally shared fragments which the poc used. So probably the sashiko1 had a point as well. Anyways i dont have time - will respond by am tommorow.Sashiko2, claim 1 (rephrased for brevity): "With the pre-loop skb_ensure_writable() removed, tcfp_off_max_hint and related use is no longer needed. I was going to say "separate patch" but since the cause is this patch that removed the use, at the expense of a much bigger patch we should remove it (delete it from include/net/tc_act/tc_pedit.h::struct tcf_pedit_parms and net/sched/act_pedit.c - as described in sashiko2 claim. Sashiko2, claim 2 (summarized): "..skb_store_bits() will memcpy() the tail bytes of that 4-byte write onto frag pages that are still referenced by the original clone, so a clone held by a tap, AF_PACKET listener, or mirroring path would see the modified bytes on its shared pages." This is the same claim as sashiko1 but sashiko2 gave a much more convincing description ;-> skb_has_shared_frag() is only true if the frags are flagged as SKBFL_SHARED_FRAG (which is what the repro did); however, if we get frags from eg a driver on ingress and that skb gets cloned with frags we won't catch it. One approach is to do an if (skb_has_any_shared_frags(skb)) and then do a skb_linearize_cow() but that sounds like overkill.Yeah, this would be overkill - imagine running tcpdump 100% will be clonedquoted
Another which will make the patch even uglier (but less expensive) is to add an extra check insde the patch's "if (write_offset < 0)" to do: if (write_offset + (int)sizeof(hdata) > 0) { skb_ensure_writable()}To be precise, something like attached (untested, uncompiled)
Tested this and it looks correct. I used a small kernel module to inject a cloned skb through the real tc ingress path with a crafted layout: crafted skb_mac_offset(), tiny/no linear head after skb->data, and the rest in a clone-shared frag tail. With Rajat's patch alone, the peer clone could still be changed. With this added hunk, the peer clone stayed unchanged. Thanks.