Thread (8 messages) 8 messages, 3 authors, 2009-09-26

Re: TCP stack bug related to F-RTO?

From: Ilpo Järvinen <hidden>
Date: 2009-09-26 17:51:09
Also in: lkml

On Sat, 26 Sep 2009, Joe Cao wrote:
Can you elaborate on "Some retransmission would happen here as step 3"?  
When the second timeout happens, it will again go into FRTO and then 
retransmit the write queue head.
Why do you think that the second RTO will happen with anything else than 
with 2.6.24. And it's perfectly ok to go into FRTO for the second time.
I looked at the patch (debian Bug#478062) that's probably what you 
mentioned as the fix. All it does was to exclude the SACK case when 
considering FRTO.  But in my case, SACK was enabled, as seen in the 
trace.
You should be looking from where I said rather than picking up your own 
sources and assuming that they'll tell you all the story :-). In fact, 
there are two fixes that were made in a row and one workaround in the
same timeframe. ...And you managed to pick the wrong one of the fixes, so 
I kind of understand why you got confused :-).
In other words, do we still have a problem with FRTO when SACK is 
enabled in the latest kernel?
For sure we might have all kinds of problems no one has yet 
noticed/reported :-). ...However, it seems that this particular problem 
your trace is showing is solved. Can you please test with a fixed kernel 
before coming back here with these claims.


-- 
 i.
--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
quoted hunk ↗ jump to hunk
From: Ilpo Järvinen <redacted>
Subject: Re: TCP stack bug related to F-RTO?
To: "Joe Cao" <redacted>
Cc: "Ray Lee" <redacted>, "Netdev" <redacted>, "LKML" <redacted>
Date: Friday, September 25, 2009, 11:03 AM
On Fri, 25 Sep 2009, Joe Cao wrote:
quoted
Thanks for the reply!  Do you happen to know
which patch fixed the 
quoted
problem?
You can find those patches from the stable queue git tree.
I gave you hint 
from what release to look from in the last mail. However,
as 2.6.24 is 
anyway obsolete my recommendation is that you should
probably consider 
upgrading to fix all the other bugs that have been found
since 2.6.24 was 
obsoleted.
quoted
Is there a bug tracking system for linux kernel?
Nothing that knows everything about everything.
quoted
I studied the FRTO code in latest kernel 2.6.31.. 
It seems the problem 
quoted
is still there:  

1. Every time a RTO fires, because tcp_is_sackfrto(tp)
returns 1, 
quoted
tcp_use_frto() returns true.  And the server tcp
enters FRTO.
quoted
2. After the head of write queue is retransmitted, two
new data packets 
quoted
are transmitted, the server receives two
dup-ACKs.  That will make the 
quoted
TCP enter tcp_enter_frto_loss(), however, that only
rests ssthresh and 
quoted
some other fields.
Perhaps those other fields are far more important than you
think... :-)
...Some retransmission would happen here as step 3.
quoted
3. After another longer RTO fires, because
tcp_is_sackfrto(tp) returns 
quoted
1, tcp_use_frto() again returns true.  The stack
enters FRTO again.
quoted
4. The above repeats and the stack couldn't
retransmits the lost packets 
quoted
faster.

Is my understanding above correct?
...No. All magic that happens in tcp_enter_frto_loss should
be enough to 
really do more than a single retransmission (that is, in
any other than 
2.6.24 series kernel). There was an unfortunate bug in this
area in 2.6.24 
which basically undoed the effect of correct actions
tcp_enter_frto_loss 
did which effectively prevented tcp_xmit_retransmit_queue
from doing its 
part.

-- 
 i.
--- On Fri, 9/25/09, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
wrote:
quoted
From: Ilpo Järvinen <redacted>
Subject: Re: TCP stack bug related to F-RTO?
To: "Ray Lee" <redacted>
Cc: "Joe Cao" <redacted>,
"Netdev" [off-list ref],
"LKML" [off-list ref],
jcaoco2002@yahoo.com
quoted
Date: Friday, September 25, 2009, 6:09 AM
On Thu, 24 Sep 2009, Ray Lee wrote:
quoted
[adding netdev cc:]

On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao [off-list ref]
wrote:
quoted
quoted
Hello,

I have found the following behavior with
different versions of linux 
quoted
quoted
kernel. The attached pcap trace is collected
with
quoted
server 
quoted
quoted
(192.168.0.13) running 2.6.24 and shows the
problem. Basically the 
quoted
quoted
behavior is like this: 

1. The client opens up a big window,
2. the server sends 19 packets in a row (pkt
#14-
quoted
#32 in the trace), but all of them are dropped due to
some
quoted
congestion.
quoted
quoted
3. The server hits RTO and retransmits pkt
#14 in
quoted
#33
quoted
quoted
4. The client immediately acks #33 (=#14),
and
quoted
the server (seems like to enter F-RTO) expends the
window
quoted
and sends *NEW* pkt #35 & #36.=A0 Timeoute is
doubled to
quoted
2*RTO; The client immediately sends two Dup-ack to #35
and
quoted
#36.
quoted
quoted
5. after 2*RTO, pkt #15 is retransmitted in
#39.
quoted
quoted
quoted
6. The client immediately acks #39 (=#15) in
#40,
quoted
and the server continues to expand the window and
sends two
quoted
*NEW* pkt #41 & #42. Now the timeoute is doubled
to 4
quoted
*RTO.
quoted
quoted
8. After 4*RTO timeout, #16 is
retransmitted.
quoted
quoted
quoted
9....
10. The above steps repeats for
retransmitting
quoted
pkt #16-#32 and each time the timeout is doubled.
quoted
quoted
11. It takes a long long time to retransmit
all
quoted
the lost packets and before that is done, the client
sends a
quoted
RST because of timeout.
quoted
quoted
The above behavior looks like F-RTO is in
effect.
quoted
 And there seems to 
quoted
quoted
be a bug in the TCP's congestion control
and
quoted
retransmission algorithm. 
quoted
quoted
Why doesn't the TCP on server (running
2.6.24)
quoted
enter the slow start? 
quoted
quoted
Why should the server take that long to
recover
quoted
from a short period 
quoted
quoted
of packet loss?

Has anyone else noticed similar problem
before?
quoted
 If my analysis was 
quoted
quoted
wrong, can anyone gives me some pointers to
what's really wrong and 
quoted
quoted
how to fix it?
Yes, 2.6.24 is an obsoleted version with known wrongs
in
quoted
FRTO 
implementation. Fixes never when to 2.6.24 stable
series as
quoted
it was 
_already_ obsoleted when the problems where reported
and
quoted
found. The 
correct fixes may be found from 2.6.25.7 (.7 iirc) and
are
quoted
included from 
2.6.26 onward too.

Just in case you happen to run ubuntu based kernel
from
quoted
that era (of 
course you should be reporting the bug here then...),
a
quoted
word of warning: 
it seemed nearly impossible for them to get a simple
thing
quoted
like that 
fixed, I haven't been looking if they'd eventually
come to
quoted
some sensible 
conclusion in that matter or is it still unresolved
(or
quoted
e.g., closed 
without real resolution).
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help