Re: TCP stack bug related to F-RTO?

From: Joe Cao <hidden>
Date: 2009-09-25 06:42:43
Also in: lkml

Possibly related (same subject, not in this thread)

2009-09-26 · Re: TCP stack bug related to F-RTO? · Joe Cao <hidden>
2009-09-26 · Re: TCP stack bug related to F-RTO? · Ilpo Järvinen <hidden>
2009-09-26 · Re: TCP stack bug related to F-RTO? · Joe Cao <hidden>
2009-09-26 · Re: TCP stack bug related to F-RTO? · Joe Cao <hidden>
2009-09-25 · Re: TCP stack bug related to F-RTO? · Ilpo Järvinen <hidden>

Hi,

On the wrong tcp checksum, that's because of hardware checksum offload.

As for the seq/ack number, because the trace is long, I deliberately removed those irrelevant packets between after the three-way handshake and when the problem happens.  That can be seen from the timestamps.

Please also note that I intentionally replaced the IP addresses and mac addresses in the trace to hide proprietary information in the trace.

Anyway, the problem is not related to the checksum, or seq/ack number, otherwise, you won't see the behavior shown in the trace.

Thanks,
Joe

--- On Thu, 9/24/09, zhigang gong <zhigang.gong@gmail.com> wrote:

From: zhigang gong <redacted>
Subject: Re: TCP stack bug related to F-RTO?
To: "Joe Cao" <redacted>
Cc: linux-kernel@vger.kernel.org, jcaoco2002@yahoo.com, netdev@vger.kernel.org
Date: Thursday, September 24, 2009, 7:32 PM
On Fri, Sep 25, 2009 at 1:43 AM, Joe
Cao [off-list ref]
wrote:

quoted

Hello,

I have found the following behavior with different

versions of linux kernel. The attached pcap trace is
collected with server (192.168.0.13) running 2.6.24 and
shows the problem. Basically the behavior is like this:

quoted

1. The client opens up a big window,
2. the server sends 19 packets in a row (pkt #14- #32

in the trace), but all of them are dropped due to some
congestion.

quoted

3. The server hits RTO and retransmits pkt #14 in #33
4. The client immediately acks #33 (=#14), and the

server (seems like to enter F-RTO) expends the window and
sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to
2*RTO; The client immediately sends two Dup-ack to #35 and
#36.

quoted

5. after 2*RTO, pkt #15 is retransmitted in #39.
6. The client immediately acks #39 (=#15) in #40, and

the server continues to expand the window and sends two
*NEW* pkt #41 & #42. Now the timeoute is doubled to 4
*RTO.

quoted

8. After 4*RTO timeout, #16 is retransmitted.
9....
10. The above steps repeats for retransmitting pkt

#16-#32 and each time the timeout is doubled.

quoted

11. It takes a long long time to retransmit all the

lost packets and before that is done, the client sends a RST
because of timeout.

quoted

The above behavior looks like F-RTO is in effect.

 And there seems to be a bug in the TCP's congestion
control and

quoted

retransmission algorithm. Why doesn't the TCP on

server (running 2.6.24) enter the slow start?
As I know, the early implementation hasn't enter slow start
if the
remote end is in the same network.  I'm not sure that
of the version
2.6.24. But after I have a look at your trace, I think this
is not the
point of your problem. The behaviour of your client
192.168.0.82 is
very strange. The client always send a packet with error
TCP checksum
and the 4# to 13# packets sent by the
client   totally don't conform
to  the TCP protocol, not only with wrong TCP checksum
but also with
incorrect seq and ack number.

My suggestion is that before you start to investigate the
server
side's behaviour, you need to correct your client side's
TCP/IP stack
implementation first.

quoted

Why should the server take that long to recover from a

short period of packet loss?

quoted

Has anyone else noticed similar problem before?  If

my analysis was wrong, can anyone gives me some pointers to
what's really wrong and how to fix it?

quoted

Thanks a lot,
Joe

PS. Please cc me when this message is replied.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help