Thread (40 messages) 40 messages, 6 authors, 3d ago

Re: [PATCH net] net/sched: fix pedit partial COW leading to page cache corruption

From: Han Guidong <hidden>
Date: 2026-05-22 07:50:13

On Thu, May 21, 2026 at 6:15 PM Jamal Hadi Salim [off-list ref] wrote:
On Thu, May 21, 2026 at 5:53 AM Jamal Hadi Salim [off-list ref] wrote:
quoted
::

On Wed, May 20, 2026 at 4:00 PM Jamal Hadi Salim [off-list ref] wrote:
quoted
On Wed, May 20, 2026 at 5:23 AM Jamal Hadi Salim [off-list ref] wrote:
quoted
On Mon, May 18, 2026 at 11:42 PM Rajat Gupta
[off-list ref] wrote:
quoted
tcf_pedit_act() computes the COW range for skb_ensure_writable()
once before the key loop using tcfp_off_max_hint, but the hint does
not account for the runtime header offset added by typed keys. This
can leave part of the write region un-COW'd.

Fix by moving skb_ensure_writable() inside the per-key loop where
the actual write offset is known, and add overflow checking on the
offset arithmetic. For negative offsets (e.g. Ethernet header edits
at ingress), use skb_cow() to COW the headroom instead. Guard
offset_valid() against INT_MIN, where negation is undefined.

Additionally, linearize skbs with shared frags upfront to prevent
silent data corruption when pedit operates on zero-copy pages
(e.g. from sendfile).
Sashiko makes the claim that:
"
....
.....
If the payload part extending past offset 0 resides in a shared
fragment for a cloned skb, wouldn't skb_store_bits() write the
positive offset  portion directly into the shared page, causing a data
corruption regression?
"

It missed the earlier code (before the quoted code is hit) which is
very well documented!
/*
 * If the skb has shared frags the user is likely using zero-copy
 * (e.g. sendfile).  Those page frags may point to page-cache pages;
 * writing into them would silently corrupt the page cache.
 * Linearize so pedit operates on a private copy.
 * TL;DR: if you want zero-copy, don't use pedit.
 */
 if (skb_has_shared_frag(skb)) {
        if (__skb_linearize(skb))
             goto bad;
 }

TLDR:
If the  skb  has shared frags, it is linearized upfront using
__skb_linearize() which means any risk of modifying a shared fragment
is non-existent.
It seems there's a second sashiko ive been made aware of - these
things are just multiplying ;->
https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260519033950.2037-1-rajat.gupta%40oss.qualcomm.com

Both points it makes are valid. Dont have time right now.
And infact looking at the second point sashiko 2 made I realize I was
only paying attention to externally shared fragments which the poc
used.
So probably the sashiko1 had a point as well. Anyways i dont have time
- will respond by am tommorow.
Sashiko2, claim 1 (rephrased for brevity):
"With the pre-loop skb_ensure_writable() removed, tcfp_off_max_hint
and related use is no longer needed.

I was going to say "separate patch" but since the cause is this patch
that removed the use, at the expense of a much bigger patch we should
remove it (delete it from include/net/tc_act/tc_pedit.h::struct
tcf_pedit_parms and net/sched/act_pedit.c - as described in sashiko2
claim.

Sashiko2, claim 2 (summarized):
"..skb_store_bits() will memcpy() the tail bytes of that 4-byte write
onto frag pages that are still referenced by the original clone, so a
clone held by a tap, AF_PACKET listener, or mirroring path would see
the modified bytes on its shared pages."

This is the same claim as sashiko1 but sashiko2 gave a much more
convincing description ;->
skb_has_shared_frag() is only true if the frags are flagged as
SKBFL_SHARED_FRAG (which is what the repro did); however, if we get
frags from eg a driver on ingress and that skb gets cloned with frags
we won't catch it.
One approach is to do an if (skb_has_any_shared_frags(skb)) and then
do a skb_linearize_cow() but that sounds like overkill.
Yeah, this would be overkill - imagine running tcpdump 100% will be cloned
quoted
Another which will make the patch even uglier (but less expensive) is
to add an extra check insde the patch's "if (write_offset < 0)"
to do:  if (write_offset + (int)sizeof(hdata) > 0) { skb_ensure_writable()}
To be precise, something like attached (untested, uncompiled)
Tested this and it looks correct.

I used a small kernel module to inject a cloned skb through the real
tc ingress path with a crafted layout: crafted skb_mac_offset(),
tiny/no linear head after skb->data, and the rest in a clone-shared
frag tail. With Rajat's patch alone, the peer clone could still be
changed. With this added hunk, the peer clone stayed unchanged.

Thanks.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help