Thread (58 messages) 58 messages, 7 authors, 2009-12-09

Re: scp stalls mysteriously

From: Ilpo Järvinen <hidden>
Date: 2009-12-03 10:29:34

Possibly related (same subject, not in this thread)

I've added Greg as CC to make him aware of this issue in early as it now 
affects 2.6.32 too (rather important to get dealt quickly in stable once 
we have a tested solution since TCP is pretty broken with the silent 
deaths this problem seems to cause). ...One possibility would be to just 
queue the tested revert to stable and sort this thing out for 2.6.33 in 
net-2.6.

Opinions, Dave?, Greg?

Now back to the issue...

You said in the other mail that "All further test are on linus-stable 
tree.", which has this contradiction that Linus does not maintain stable 
trees. Which exactly was the tree used for the .9. test, Linus' tree or 
the 2.6.31 stable tree? I suppose the former since the revert wouldn't 
apply to 2.6.31 so I just want to confirm.


On Thu, 3 Dec 2009, Frederic Leroy wrote:
On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote:
quoted
could you please printk retrans_stamp just before the return in 
include/net/tcp.h:retransmits_timed_out()?
If the value is not monotonically increasing but is reset to 0 at some
point, this might lead to problems in tcp_write_timeout().
It's the only idea I have now.
Your idea is good.
Only one out of 4 value is not null.

Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10

I make 2 attempts. Printk corresponding to .10 are those after the line 
"wlan1 enter promiscuous mode"
Nice thinking indeed Damian, thanks. ...But but, where exactly did you 
print? ...There are multiple returns and the return false branch is 
expected to have a zero retrans_stamp in a typical case but that is not
a problem because we never use the value.

...Anyway, if I'm wrong with my suspicion and it still holds that we have 
zero retrans_stamp in the substraction too, it could have something to do 
with this snippet:

static void tcp_try_to_open(struct sock *sk, int flag)
{
        struct tcp_sock *tp = tcp_sk(sk);

        tcp_verify_left_out(tp);

        if (!tp->frto_counter && tp->retrans_out == 0)
                tp->retrans_stamp = 0;

...It bit me last time when FRTO was enabled after very small modification 
(without running a full verification after the trivial looking 
modification). ...So I've worked around this clearing for FRTO as you 
can see :-).


Also, we have the another mystery to be solved, the fast retransmission is 
not triggered for some reason (or alternatively not captured in to a 
log), even in the working .9. case. It would be easy to see whether it 
works at all from TCP point of view by looking into mibs once you have 
have some transfers in a working configuration:

grep -A1 TCP /proc/net/netstat

...luckily this fast retransmit issue is less crucial as almost all people 
are pretty happy already if their RTO-based recovery works even if the 
fast recovery would not. So figuring it out can be postponed (if one has 
to prioritize) until the silent death issue is out of the way.


-- 
 i.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help