Re: TCP stack bug related to F-RTO?
From: Joe Cao <hidden>
Date: 2009-09-26 20:48:28
Also in:
lkml
Possibly related (same subject, not in this thread)
- 2009-09-25 · Re: TCP stack bug related to F-RTO? · Joe Cao <hidden>
- 2009-09-25 · Re: TCP stack bug related to F-RTO? · zhigang gong <hidden>
- 2009-09-25 · Re: TCP stack bug related to F-RTO? · Joe Cao <hidden>
- 2009-09-25 · Re: TCP stack bug related to F-RTO? · zhigang gong <hidden>
Hi Ilpo, Thanks for the replay. We noticed the problem while we were debugging a connection failure case reported by one of our customers (we are a network device vendor). Actually we have suggested our customer to upgrade their server software to fix the problem, and we are still waiting for the feedback from them. Meanwhile, I asked all those questions just because I want to understand the issue and the fixes. We also has to convince the customer to move to a right kernel and don't want them to come back with the same problem again. Again, thanks for the help! Joe
--- On Sat, 9/26/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
quoted hunk ↗ jump to hunk
From: Ilpo Järvinen <redacted> Subject: Re: TCP stack bug related to F-RTO? To: "Joe Cao" <redacted> Cc: "Ray Lee" <redacted>, "Netdev" <redacted>, "LKML" <redacted> Date: Saturday, September 26, 2009, 10:51 AM On Sat, 26 Sep 2009, Joe Cao wrote:quoted
Can you elaborate on "Some retransmission would happenhere as step 3"?quoted
When the second timeout happens, it will again go intoFRTO and thenquoted
retransmit the write queue head.Why do you think that the second RTO will happen with anything else than with 2.6.24. And it's perfectly ok to go into FRTO for the second time.quoted
I looked at the patch (debian Bug#478062) that'sprobably what youquoted
mentioned as the fix. All it does was to exclude theSACK case whenquoted
considering FRTO. But in my case, SACK wasenabled, as seen in thequoted
trace..You should be looking from where I said rather than picking up your own sources and assuming that they'll tell you all the story :-). In fact, there are two fixes that were made in a row and one workaround in the same timeframe. ...And you managed to pick the wrong one of the fixes, so I kind of understand why you got confused :-).quoted
In other words, do we still have a problem with FRTOwhen SACK isquoted
enabled in the latest kernel?For sure we might have all kinds of problems no one has yet noticed/reported :-). ....However, it seems that this particular problem your trace is showing is solved. Can you please test with a fixed kernel before coming back here with these claims. -- i.--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>wrote:quoted
From: Ilpo Järvinen <redacted> Subject: Re: TCP stack bug related to F-RTO? To: "Joe Cao" <redacted> Cc: "Ray Lee" <redacted>,"Netdev" [off-list ref], "LKML" [off-list ref]quoted
Date: Friday, September 25, 2009, 11:03 AM On Fri, 25 Sep 2009, Joe Cao wrote:quoted
Thanks for the reply! Do you happen to knowwhich patch fixed thequoted
problem?You can find those patches from the stable queue gittree.quoted
I gave you hint from what release to look from in the last mail.However,quoted
as 2.6.24 is anyway obsolete my recommendation is that you should probably consider upgrading to fix all the other bugs that have beenfoundquoted
since 2.6.24 was obsoleted.quoted
Is there a bug tracking system for linux kernel?Nothing that knows everything about everything.quoted
I studied the FRTO code in latest kernel2.6.31..quoted
It seems the problemquoted
is still there: 1. Every time a RTO fires, becausetcp_is_sackfrto(tp)quoted
returns 1,quoted
tcp_use_frto() returns true. And the servertcpquoted
enters FRTO.quoted
2. After the head of write queue isretransmitted, twoquoted
new data packetsquoted
are transmitted, the server receives twodup-ACKs. That will make thequoted
TCP enter tcp_enter_frto_loss(), however, thatonlyquoted
rests ssthresh andquoted
some other fields.Perhaps those other fields are far more important thanyouquoted
think... :-) ...Some retransmission would happen here as step 3.quoted
3. After another longer RTO fires, becausetcp_is_sackfrto(tp) returnsquoted
1, tcp_use_frto() again returns true. Thestackquoted
enters FRTO again.quoted
4. The above repeats and the stack couldn'tretransmits the lost packetsquoted
faster. Is my understanding above correct?...No. All magic that happens in tcp_enter_frto_lossshouldquoted
be enough to really do more than a single retransmission (that is,inquoted
any other than 2.6.24 series kernel). There was an unfortunate bug inthisquoted
area in 2.6.24 which basically undoed the effect of correct actions tcp_enter_frto_loss did which effectively preventedtcp_xmit_retransmit_queuequoted
from doing its part. -- i.--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>wrote:quoted
From: Ilpo Järvinen <redacted> Subject: Re: TCP stack bug related to F-RTO? To: "Ray Lee" <redacted> Cc: "Joe Cao" <redacted>,"Netdev" [off-list ref], "LKML" [off-list ref], jcaoco2002@yahoo.comquoted
Date: Friday, September 25, 2009, 6:09 AM On Thu, 24 Sep 2009, Ray Lee wrote:quoted
[adding netdev cc:] On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao[off-list ref]quoted
quoted
wrote:quoted
quoted
Hello, I have found the following behaviorwithquoted
quoted
different versions of linuxquoted
quoted
kernel. The attached pcap trace iscollectedquoted
withquoted
serverquoted
quoted
(192.168.0.13) running 2.6.24 and showsthequoted
quoted
problem. Basically thequoted
quoted
behavior is like this: 1. The client opens up a big window, 2. the server sends 19 packets in a row(pktquoted
#14-quoted
#32 in the trace), but all of them are droppeddue toquoted
somequoted
congestion.quoted
quoted
3. The server hits RTO and retransmitspktquoted
#14 inquoted
#33quoted
quoted
4. The client immediately acks #33(=#14),quoted
andquoted
the server (seems like to enter F-RTO) expendsthequoted
windowquoted
and sends *NEW* pkt #35 & #36.=A0 Timeouteisquoted
doubled toquoted
2*RTO; The client immediately sends two Dup-ackto #35quoted
andquoted
#36.quoted
quoted
5. after 2*RTO, pkt #15 isretransmitted inquoted
#39.quoted
quoted
quoted
6. The client immediately acks #39(=#15) inquoted
#40,quoted
and the server continues to expand the windowandquoted
sends twoquoted
*NEW* pkt #41 & #42. Now the timeoute isdoubledquoted
to 4quoted
*RTO.quoted
quoted
8. After 4*RTO timeout, #16 isretransmitted.quoted
quoted
quoted
9.... 10. The above steps repeats forretransmittingquoted
pkt #16-#32 and each time the timeout isdoubled.quoted
quoted
quoted
quoted
11. It takes a long long time toretransmitquoted
allquoted
the lost packets and before that is done, theclientquoted
sends aquoted
RST because of timeout.quoted
quoted
The above behavior looks like F-RTO isinquoted
effect.quoted
And there seems toquoted
quoted
be a bug in the TCP's congestioncontrolquoted
andquoted
retransmission algorithm.quoted
quoted
Why doesn't the TCP on server (running2.6.24)quoted
enter the slow start?quoted
quoted
Why should the server take that longtoquoted
recoverquoted
from a short periodquoted
quoted
of packet loss? Has anyone else noticed similarproblemquoted
before?quoted
If my analysis wasquoted
quoted
wrong, can anyone gives me somepointers toquoted
quoted
what's really wrong andquoted
quoted
how to fix it?Yes, 2.6.24 is an obsoleted version with knownwrongsquoted
inquoted
FRTO implementation. Fixes never when to 2.6.24stablequoted
series asquoted
it was _already_ obsoleted when the problems wherereportedquoted
andquoted
found. The correct fixes may be found from 2.6.25.7 (.7iirc) andquoted
arequoted
included from 2.6.26 onward too. Just in case you happen to run ubuntu basedkernelquoted
fromquoted
that era (of course you should be reporting the bug herethen...),quoted
aquoted
word of warning: it seemed nearly impossible for them to get asimplequoted
thingquoted
like that fixed, I haven't been looking if they'deventuallyquoted
come toquoted
some sensible conclusion in that matter or is it stillunresolvedquoted
(orquoted
e.g., closed without real resolution).