Re: WARNING in sk_stream_kill_queues (5)
From: Marco Elver <elver@google.com>
Date: 2020-12-10 19:02:58
Also in:
lkml
On Thu, 10 Dec 2020 at 18:14, Eric Dumazet [off-list ref] wrote:
On Thu, Dec 10, 2020 at 5:51 PM Marco Elver [off-list ref] wrote:
[...]
quoted
So I started putting gdb to work, and whenever I see an allocation exactly like the above that goes through tso_fragment() a warning immediately follows. Long story short, I somehow synthesized this patch that appears to fix things, but I can't explain why exactly: | --- a/net/core/skbuff.c | +++ b/net/core/skbuff.c | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, | | skb_metadata_clear(skb); | | - /* It is not generally safe to change skb->truesize. | - * For the moment, we really care of rx path, or | - * when skb is orphaned (not attached to a socket). | - */ | - if (!skb->sk || skb->destructor == sock_edemux) | - skb->truesize += size - osize; | - | return 0; | | nofrags: Now, here are the breadcrumbs I followed: 1. Breakpoint on kfence_ksize() -- first allocation that matches the above: | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726 | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737 | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178 | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217 | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144 | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888 | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>, | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124 | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0) | at net/ipv4/tcp_output.c:2674 | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0) | at ./include/net/sock.h:918 | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864 | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374 | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869 | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668 | [...] Set watchpoint on skb->truesize: | (gdb) frame 3 | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217 | 217 size = SKB_WITH_OVERHEAD(ksize(data)); | (gdb) p &skb->truesize | $5 = (unsigned int *) 0xffff888117f55f90 | (gdb) awatch *0xffff888117f55f90 | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90 2. Some time later, we see that the skb with kfence-allocated data is cloned: | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90 | | Value = 1570 | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002 | 1002 C(truesize); | (gdb) bt | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002 | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454 | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592, | rcv_nxt=0) at net/ipv4/tcp_output.c:1267 | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>, | sk=<optimized out>) at ./include/linux/tcp.h:439 | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0) | at net/ipv4/tcp_output.c:2688 | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0) | at ./include/net/sock.h:918 | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864 | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374 | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869 | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668 | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010 [...] 3. The original skb (that was cloned) has its truesize adjusted after a pskb_expand_head(): | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90 | | Old value = 1570 | New value = 1954 ^^ the difference between the old and the new value is exactly 384, which is also the final underflow of the sk_wmem_queued that triggers the warning. Presumably if the original allocation had been through kmalloc-1k and not KFENCE, the difference here would have been 0, since ksize() of the original allocation in step (1) would have been 1024, and not 640 (difference of 384). | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192) | at net/core/skbuff.c:1687 | 1687 skb->truesize += size - osize; | (gdb) bt | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192) | at net/core/skbuff.c:1687 | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29 | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276 | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351 | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>, | to=<optimized out>) at net/ipv4/tcp_input.c:1497 | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>, | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605 | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894, | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false) | at net/ipv4/tcp_input.c:1670 | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196, | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931 | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758 | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858 | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668 [...] Any of this make sense?Very nice debugging ! I guess we could fix this in skb_prepare_for_shift(), eventually caring for the truesize manipulation (or reverting the change done in pskb_expand_head(), since only kfence is having this issue.
Phew, good to hear I finally got lucky. :-) Either option is fine, as long as it avoids this problem in future. Hopefully it can be fixed for 5.11.
quoted hunk ↗ jump to hunk
(All TCP skbs in output path have the same allocation size for skb->head)diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3100644--- a/net/core/skbuff.c +++ b/net/core/skbuff.c@@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split); */ static int skb_prepare_for_shift(struct sk_buff *skb) { - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC); + unsigned int ret = 0, save; + + if (skb_cloned(skb)) { + save = skb->truesize; + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC); + skb->truesize = save; + } + return ret; }
FWIW,
Tested-by: Marco Elver [off-list ref]
Thanks,
-- Marco