Thread (43 messages) 43 messages, 6 authors, 2025-03-14

RE: [PATCH] tcp: check socket state before calling WARN_ON

From: Dujeong.lee <hidden>
Date: 2025-01-03 04:16:06
Also in: lkml

On Thu, Jan 2, 2025 at 5:17 PM Eric Dumazet [off-list ref]
wrote:
To: Dujeong.lee <redacted>
Cc: Youngmin Nam <redacted>; Jakub Kicinski
[off-list ref]; Neal Cardwell [off-list ref];
davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com;
horms@kernel.org; guo88.liu@samsung.com; yiwang.cai@samsung.com;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
joonki.min@samsung.com; hajun.sung@samsung.com; d7271.choe@samsung.com;
sw.ju@samsung.com; iamyunsu.kim@samsung.com; kw0619.kim@samsung.com;
hsl.lim@samsung.com; hanbum22.lee@samsung.com; chaemoo.lim@samsung.com;
seungjin1.yu@samsung.com
Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON

On Thu, Jan 2, 2025 at 1:22 AM Dujeong.lee [off-list ref] wrote:
quoted
On Mon, Dec 30, 2024 at 6:34 PM Eric Dumazet [off-list ref]
wrote:
quoted
On Mon, Dec 30, 2024 at 1:24 AM Dujeong.lee
[off-list ref]
wrote:
quoted
On Wed, Dec 18, 2024 7:28 PM Eric Dumazet [off-list ref]
wrote:
quoted
quoted
quoted
quoted
On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee
[off-list ref]
wrote:
quoted
Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
quoted
On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
quoted
On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet
[off-list ref]
wrote:
quoted
quoted
quoted
quoted
On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
[off-list ref]
wrote:
quoted
quoted
Thanks for all the details on packetdrill and we are
also exploring
USENIX 2013 material.
quoted
I have one question. The issue happens when DUT
receives TCP ack with
large delay from network, e.g., 28seconds since last Tx.
Is packetdrill able to emulate this network delay (or
congestion) in script
level?
quoted
Yes, the packetdrill scripts can wait an arbitrary
amount of time between each event

+28 <next event>

28 seconds seems okay. If the issue was triggered after
4 days, packetdrill would be impractical ;)
Hi all,

We secured new ramdump.
Please find the below values with TCP header details.

tp->packets_out = 0
tp->sacked_out = 0
tp->lost_out = 1
tp->retrans_out = 1
tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
((struct inet_connection_sock *)sk)->icsk_ca_state = 4
((struct inet_connection_sock *)sk)->icsk_pmtu_cookie =
1500

Hex from ip header:
45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00
F7 01 BB
A7 CC 51
F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A
89 72
C8
42
62 F5
F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6

Transmission Control Protocol Source Port: 443 Destination
Port: 42956 TCP Segment Len: 0 Sequence Number (raw):
1375232972 Acknowledgment number (raw):
1381592486
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 1028
Calculated window size: 1028 Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation
(NOP), Timestamps, No-Operation (NOP), No-Operation (NOP),
SACK

If anyone wants to check other values, please feel free to
ask me

Thanks,
Dujeong.
I have a question.

From the latest ramdump I could see that
1) tcp_sk(sk)->packets_out = 0
2) inet_csk(sk)->icsk_backoff = 0
3) sk_write_queue.len = 0
which suggests that tcp_write_queue_purge was indeed called.

Noting that:
1) tcp_write_queue_purge reset packets_out to 0 and
2) in_flight should be non-negative where in_flight =
packets_out - left_out + retrans_out, what if we reset
left_out and retrans_out as well in tcp_write_queue_purge?

Do we see any potential issue with this?
Hello Eric and Neal.

It is a gentle reminder.
Could you please review the latest ramdump values and and
question?
quoted
quoted
quoted
quoted
It will have to wait next year, Neal is OOO.

I asked a packetdrill reproducer, I can not spend days working
on an issue that does not trigger in our production hosts.

Something could be wrong in your trees, or perhaps some eBPF
program changing the state of the socket...
Hi Eric

I tried to make packetdrill script for local mode, which injects
delayed
acks for data and FIN after close.
quoted
// Test basic connection teardown where local process closes first:
// the local process calls close() first, so we send a FIN.
// Then we receive an delayed ACK for data and FIN.
// Then we receive a FIN and ACK it.

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
//
quoted
quoted
Create socket
quoted
   +.01...0.011 connect(3, ..., ...) = 0
//
quoted
quoted
Initiate connection
quoted
   +0 >  S 0:0(0) <...>                                                       //
Send
quoted
quoted
SYN
quoted
   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
6,nop,nop,sackOK>
// Receive SYN-ACK with TCP options
quoted
   +0 >  . 1:1(0) ack 1                                                       //
Send
quoted
quoted
ACK
quoted
   +0 write(3, ..., 1000) = 1000                                              //
Write 1000 bytes
quoted
   +0 >  P. 1:1001(1000) ack 1                                                //
Send
quoted
quoted
data with PSH flag
quoted
   +0 close(3) = 0                                                            //
Local
quoted
quoted
side initiates close
quoted
   +0 >  F. 1001:1001(0) ack 1                                                //
Send
quoted
quoted
FIN
quoted
   +1 < . 1:1(0) ack 1001 win 257
//
quoted
quoted
Receive ACK for data
quoted
   +0 < . 1:1(0) ack 1002 win 257                                             //
Receive ACK for FIN
quoted
   +0 < F. 1:1(0) ack 1002 win 257                                            //
Receive FIN from remote
quoted
   +0 >  . 1002:1002(0) ack 2                                                 //
Send
quoted
quoted
ACK for FIN
quoted

But got below error when I run the script.

$ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
../tcp/close/close-half-delayed-ack.pkt:22: error handling packet:
live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1)
script
packet:  1.010997 . 1002:1002(0) ack 2 actual packet:  0.014840 F.
1001:1001(0) ack 1 win 256
This means the FIN was retransmited earlier.
Then the data segment was probably also retransmit.

You can use "tcpdump -i any &" while developing your script.

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
       // Create socket
   +.01...0.111 connect(3, ..., ...) = 0
       // Initiate connection
   +0 >  S 0:0(0) <...>
       // Send SYN
  +.1 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
   +0 >  . 1:1(0) ack 1
       // Send ACK

   +0 write(3, ..., 1000) = 1000
       // Write 1000 bytes
   +0 >  P. 1:1001(1000) ack 1
       // Send data with PSH flag

   +0 close(3) = 0
       // Local side initiates close
   +0 >  F. 1001:1001(0) ack 1
       // Send FIN
  +.2 >  F. 1001:1001(0) ack 1    // FIN retransmit
+.2~+.4 >  P. 1:1001(1000) ack 1 // RTX

   +0 < . 1:1(0) ack 1001 win 257
        // Receive ACK for data
   +0 > F. 1001:1001(0) ack 1 // FIN retransmit
   +0 < . 1:1(0) ack 1002 win 257
       // Receive ACK for FIN

   +0 < F. 1:1(0) ack 1002 win 257
       // Receive FIN from remote
   +0 >  . 1002:1002(0) ack 2
       // Send ACK for FIN
Hi Eric,

I modified the script and inlined tcpdump capture

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               //
Create socket
quoted
   +.01...0.011 connect(3, ..., ...) = 0                                      //
Initiate connection
quoted
   +0 >  S 0:0(0) <...>                                                       // Send
SYN
quoted
1 0.000000 192.168.114.235 192.0.2.1 TCP 80 40784 → 8080 [SYN] Seq=0
Win=65535 Len=0 MSS=1460 SACK_PERM TSval=2913446377 TSecr=0 WS=256

   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>
// Receive SYN-ACK with TCP options
quoted
2 0.000209 192.0.2.1 192.168.114.235 TCP 72 8080 → 40784 [SYN, ACK]
Seq=0 Ack=1 Win=32768 Len=0 MSS=1000 WS=64 SACK_PERM

   +0 >  . 1:1(0) ack 1                                                       // Send
ACK
quoted
3 0.000260 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [ACK] Seq=1
Ack=1 Win=65536 Len=0

   +0 write(3, ..., 1000) = 1000                                              //
Write 1000 bytes
quoted
   +0 >  P. 1:1001(1000) ack 1                                                // Send
data with PSH flag
quoted
4 0.000344 192.168.114.235 192.0.2.1 TCP 1060 40784 → 8080 [PSH, ACK]
Seq=1 Ack=1 Win=65536 Len=1000

   +0 close(3) = 0                                                            // Local
side initiates close
quoted
   +0 >  F. 1001:1001(0) ack 1                                                // Send
FIN
quoted
5 0.000381 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [FIN, ACK]
Seq=1001 Ack=1 Win=65536 Len=0

   +.2 >  F. 1001:1001(0) ack 1                                               // FIN
retransmit
quoted
6 0.004545 192.168.114.235 192.0.2.1 TCP 60 [TCP Retransmission] 40784
→ 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0

   +.2~+.4 >  P. 1:1001(1000) ack 1                                           // RTX
   +0 < . 1:1(0) ack 1001 win 257                                             //
Receive ACK for data
quoted
   +0 < . 1:1(0) ack 1002 win 257                                             //
Receive ACK for FIN
quoted
   +0 < F. 1:1(0) ack 1002 win 257                                            //
Receive FIN from remote
quoted
   +0 >  . 1002:1002(0) ack 2                                                 // Send
ACK for FIN
quoted

And hit below error.
../tcp/close/close-half-delayed-ack.pkt:18: error handling packet:
timing error: expected outbound packet at 0.210706 sec but happened at
0.014838 sec; tolerance 0.025002 sec script packet:  0.210706 F.
1001:1001(0) ack 1 actual packet:  0.014838 F. 1001:1001(0) ack 1 win
256

For me, it looks like delay in below line does not take effect by
packetdrill.
quoted
+.2 >  F. 1001:1001(0) ack 1                                               // FIN
retransmit

I think you misunderstood how packetdrill works.

In packetdrill, you can specify delays for incoming packets (to account
for network delays, or remote TCP stack bugs/behavior)

But outgoing packets are generated by the kernel TCP stack.
Packetdrill checks that these packets have the expected layouts and sent
at the expected time.

Hi Eric and Neal

Thanks for explanation.
Now I updated script based on local packet Tx pattern based on tcpdump
and injected delay for remote packet. Now it works without any issue.
// Test basic connection teardown where local process closes first:
// the local process calls close() first, so we send a FIN.
// Then we receive an delayed ACK for data and FIN.
// Then we receive a FIN and ACK it.

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
   +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
   +0 >  S 0:0(0) <...>                                                       // Send SYN
   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
   +0 >  . 1:1(0) ack 1                                                       // Send ACK

   +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes

   +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag

   +0 close(3) = 0                                                            // Local side initiates close

   +0 >  F. 1001:1001(0) ack 1                                                // Send FIN

   +0 >  F. 1001:1001(0) ack 1                                                // FIN retransmit

   +.2  >  P. 1:1001(1000) ack 1                                              // RTX
   +.4  >  P. 1:1001(1000) ack 1                                              // RTX
   +.8  >  P. 1:1001(1000) ack 1                                              // RTX
   +1.6 >  P. 1:1001(1000) ack 1                                              // RTX
   +3.2 >  P. 1:1001(1000) ack 1                                              // RTX
   +6.4 >  P. 1:1001(1000) ack 1                                              // RTX
   +13  >  P. 1:1001(1000) ack 1                                              // RTX
   +26  >  P. 1:1001(1000) ack 1                                              // RTX

   +1 < . 1:1(0) ack 1002 win 257                                            // Receive ACK for FIN

   +1 < . 1:1(0) ack 1001 win 257                                            // Receive ACK for data

   +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote

We will develop the script to reliably reproduce the case.
Maybe we need to get good tcpdump trace when issue happens. But it would take sometime.

In the meantime, since we have complete set of ramdump snapshot, it would be appreciated if Neal could find anything from the values I provided earlier.

tp->packets_out = 0
tp->sacked_out = 0
tp->lost_out = 1
tp->retrans_out = 1
tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
((struct inet_connection_sock *)sk)->icsk_ca_state = 4 
((struct inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
Hex from ip header:
45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7 
CC 51
F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42
62 F5
F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
Transmission Control Protocol
Source Port: 443
Destination Port: 42956
TCP Segment Len: 0
Sequence Number (raw): 1375232972
Acknowledgment number (raw): 1381592486
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 1028
Calculated window size: 1028
Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), 
Timestamps, No-Operation (NOP), No-Operation (NOP), SACK

Thanks,
Dujeong.

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help