Re: WARNING in sk_stream_kill_queues (5)

From: Eric Dumazet <edumazet@google.com>
Date: 2020-12-14 10:48:41
Also in: lkml

On Mon, Dec 14, 2020 at 11:09 AM Marco Elver [off-list ref] wrote:

On Thu, 10 Dec 2020 at 20:01, Marco Elver [off-list ref] wrote:

quoted

On Thu, 10 Dec 2020 at 18:14, Eric Dumazet [off-list ref] wrote:

quoted

On Thu, Dec 10, 2020 at 5:51 PM Marco Elver [off-list ref] wrote:

[...]

quoted

So I started putting gdb to work, and whenever I see an allocation
exactly like the above that goes through tso_fragment() a warning
immediately follows.

Long story short, I somehow synthesized this patch that appears to fix
things, but I can't explain why exactly:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
|
|       skb_metadata_clear(skb);
|
| -     /* It is not generally safe to change skb->truesize.
| -      * For the moment, we really care of rx path, or
| -      * when skb is orphaned (not attached to a socket).
| -      */
| -     if (!skb->sk || skb->destructor == sock_edemux)
| -             skb->truesize += size - osize;
| -
|       return 0;
|
|  nofrags:

Now, here are the breadcrumbs I followed:


1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:

        | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
        | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
        | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
        | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
        | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
        | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
        | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
        |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
        | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
        |     at net/ipv4/tcp_output.c:2674
        | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
        |     at ./include/net/sock.h:918
        | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
        | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
        | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
        | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
        | [...]

        Set watchpoint on skb->truesize:

        | (gdb) frame 3
        | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
        | 217             size = SKB_WITH_OVERHEAD(ksize(data));
        | (gdb) p &skb->truesize
        | $5 = (unsigned int *) 0xffff888117f55f90
        | (gdb) awatch *0xffff888117f55f90
        | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90

2.      Some time later, we see that the skb with kfence-allocated data
        is cloned:

        | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
        |
        | Value = 1570
        | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
        | 1002            C(truesize);
        | (gdb) bt
        | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
        | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
        | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
        |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
        | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
        |     sk=<optimized out>) at ./include/linux/tcp.h:439
        | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
        |     at net/ipv4/tcp_output.c:2688
        | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
        |     at ./include/net/sock.h:918
        | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
        | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
        | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
        | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
        | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
        [...]


3.      The original skb (that was cloned) has its truesize adjusted
        after a pskb_expand_head():

        | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
        |
        | Old value = 1570
        | New value = 1954

        ^^ the difference between the old and the new value is exactly
        384, which is also the final underflow of the sk_wmem_queued
        that triggers the warning. Presumably if the original allocation
        had been through kmalloc-1k and not KFENCE, the difference here
        would have been 0, since ksize() of the original allocation in
        step (1) would have been 1024, and not 640 (difference of 384).

        | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
        |     at net/core/skbuff.c:1687
        | 1687                    skb->truesize += size - osize;
        | (gdb) bt
        | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
        |     at net/core/skbuff.c:1687
        | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
        | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
        | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
        | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
        |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
        | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
        |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
        | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
        |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
        |     at net/ipv4/tcp_input.c:1670
        | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
        |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
        | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
        | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
        | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
        [...]


Any of this make sense?

Very nice debugging !

I guess we could fix this in skb_prepare_for_shift(), eventually
caring for the truesize manipulation
(or reverting the change done in pskb_expand_head(), since only kfence
is having this issue.

Phew, good to hear I finally got lucky. :-)

Either option is fine, as long as it avoids this problem in future.
Hopefully it can be fixed for 5.11.

quoted

(All TCP skbs in output path have the same allocation size for skb->head)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3

--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c

@@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
  */
 static int skb_prepare_for_shift(struct sk_buff *skb)
 {
-       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+       unsigned int ret = 0, save;
+
+       if (skb_cloned(skb)) {
+               save = skb->truesize;
+               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+               skb->truesize = save;
+       }
+       return ret;
 }

FWIW,

    Tested-by: Marco Elver [off-list ref]

Has this patch, or similar, already been sent?


Not yet, we have few weeks left before 5.11 is released ;)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help