Re: TCP stack bug related to F-RTO?
From: Ilpo Järvinen <hidden>
Date: 2009-09-26 17:51:09
Also in:
lkml
On Sat, 26 Sep 2009, Joe Cao wrote:
Can you elaborate on "Some retransmission would happen here as step 3"? When the second timeout happens, it will again go into FRTO and then retransmit the write queue head.
Why do you think that the second RTO will happen with anything else than with 2.6.24. And it's perfectly ok to go into FRTO for the second time.
I looked at the patch (debian Bug#478062) that's probably what you mentioned as the fix. All it does was to exclude the SACK case when considering FRTO. But in my case, SACK was enabled, as seen in the trace.
You should be looking from where I said rather than picking up your own sources and assuming that they'll tell you all the story :-). In fact, there are two fixes that were made in a row and one workaround in the same timeframe. ...And you managed to pick the wrong one of the fixes, so I kind of understand why you got confused :-).
In other words, do we still have a problem with FRTO when SACK is enabled in the latest kernel?
For sure we might have all kinds of problems no one has yet noticed/reported :-). ...However, it seems that this particular problem your trace is showing is solved. Can you please test with a fixed kernel before coming back here with these claims. -- i.
--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
quoted hunk ↗ jump to hunk
From: Ilpo Järvinen <redacted> Subject: Re: TCP stack bug related to F-RTO? To: "Joe Cao" <redacted> Cc: "Ray Lee" <redacted>, "Netdev" <redacted>, "LKML" <redacted> Date: Friday, September 25, 2009, 11:03 AM On Fri, 25 Sep 2009, Joe Cao wrote:quoted
Thanks for the reply! Do you happen to knowwhich patch fixed thequoted
problem?You can find those patches from the stable queue git tree. I gave you hint from what release to look from in the last mail. However, as 2.6.24 is anyway obsolete my recommendation is that you should probably consider upgrading to fix all the other bugs that have been found since 2.6.24 was obsoleted.quoted
Is there a bug tracking system for linux kernel?Nothing that knows everything about everything.quoted
I studied the FRTO code in latest kernel 2.6.31..It seems the problemquoted
is still there: 1. Every time a RTO fires, because tcp_is_sackfrto(tp)returns 1,quoted
tcp_use_frto() returns true. And the server tcpenters FRTO.quoted
2. After the head of write queue is retransmitted, twonew data packetsquoted
are transmitted, the server receives twodup-ACKs. That will make thequoted
TCP enter tcp_enter_frto_loss(), however, that onlyrests ssthresh andquoted
some other fields.Perhaps those other fields are far more important than you think... :-) ...Some retransmission would happen here as step 3.quoted
3. After another longer RTO fires, becausetcp_is_sackfrto(tp) returnsquoted
1, tcp_use_frto() again returns true. The stackenters FRTO again.quoted
4. The above repeats and the stack couldn'tretransmits the lost packetsquoted
faster. Is my understanding above correct?...No. All magic that happens in tcp_enter_frto_loss should be enough to really do more than a single retransmission (that is, in any other than 2.6.24 series kernel). There was an unfortunate bug in this area in 2.6.24 which basically undoed the effect of correct actions tcp_enter_frto_loss did which effectively prevented tcp_xmit_retransmit_queue from doing its part. -- i.--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>wrote:quoted
From: Ilpo Järvinen <redacted> Subject: Re: TCP stack bug related to F-RTO? To: "Ray Lee" <redacted> Cc: "Joe Cao" <redacted>,"Netdev" [off-list ref], "LKML" [off-list ref], jcaoco2002@yahoo.comquoted
Date: Friday, September 25, 2009, 6:09 AM On Thu, 24 Sep 2009, Ray Lee wrote:quoted
[adding netdev cc:] On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao [off-list ref]wrote:quoted
quoted
Hello, I have found the following behavior withdifferent versions of linuxquoted
quoted
kernel. The attached pcap trace is collectedwithquoted
serverquoted
quoted
(192.168.0.13) running 2.6.24 and shows theproblem. Basically thequoted
quoted
behavior is like this: 1. The client opens up a big window, 2. the server sends 19 packets in a row (pkt#14-quoted
#32 in the trace), but all of them are dropped due tosomequoted
congestion.quoted
quoted
3. The server hits RTO and retransmits pkt#14 inquoted
#33quoted
quoted
4. The client immediately acks #33 (=#14),andquoted
the server (seems like to enter F-RTO) expends thewindowquoted
and sends *NEW* pkt #35 & #36.=A0 Timeoute isdoubled toquoted
2*RTO; The client immediately sends two Dup-ack to #35andquoted
#36.quoted
quoted
5. after 2*RTO, pkt #15 is retransmitted in#39.quoted
quoted
quoted
6. The client immediately acks #39 (=#15) in#40,quoted
and the server continues to expand the window andsends twoquoted
*NEW* pkt #41 & #42. Now the timeoute is doubledto 4quoted
*RTO.quoted
quoted
8. After 4*RTO timeout, #16 isretransmitted.quoted
quoted
quoted
9.... 10. The above steps repeats forretransmittingquoted
pkt #16-#32 and each time the timeout is doubled.quoted
quoted
11. It takes a long long time to retransmitallquoted
the lost packets and before that is done, the clientsends aquoted
RST because of timeout.quoted
quoted
The above behavior looks like F-RTO is ineffect.quoted
And there seems toquoted
quoted
be a bug in the TCP's congestion controlandquoted
retransmission algorithm.quoted
quoted
Why doesn't the TCP on server (running2.6.24)quoted
enter the slow start?quoted
quoted
Why should the server take that long torecoverquoted
from a short periodquoted
quoted
of packet loss? Has anyone else noticed similar problembefore?quoted
If my analysis wasquoted
quoted
wrong, can anyone gives me some pointers towhat's really wrong andquoted
quoted
how to fix it?Yes, 2.6.24 is an obsoleted version with known wrongsinquoted
FRTO implementation. Fixes never when to 2.6.24 stableseries asquoted
it was _already_ obsoleted when the problems where reportedandquoted
found. The correct fixes may be found from 2.6.25.7 (.7 iirc) andarequoted
included from 2.6.26 onward too. Just in case you happen to run ubuntu based kernelfromquoted
that era (of course you should be reporting the bug here then...),aquoted
word of warning: it seemed nearly impossible for them to get a simplethingquoted
like that fixed, I haven't been looking if they'd eventuallycome toquoted
some sensible conclusion in that matter or is it still unresolved(orquoted
e.g., closed without real resolution).