Re: [TCP bug] stuck distcc connections in latest -git
From: Ingo Molnar <hidden>
Date: 2008-07-22 13:57:59
Also in:
lkml
* David Newall [off-list ref] wrote:
quoted
The hung condition seemed permanent (i waited a couple of minutes).Not nearly long enough. Retransmits can be sent as infrequently as per 180 seconds. I think there's an argument to use one of the the various patches that reduce your TCP_RTO_MAX, for example OBATA Noboru's (http://marc.info/?l=linux-netdev&m=118422471428855): you don't have to wait unreasonably long before seeing a retransmit. Remember, three minutes!
i know, i waited much more than 180 minutes - about 15 minutes. That is more than enough for this LAN connection. It's all on the LAN directly via a single gigabit switch and no packet dropping. I noticed the hung build immediately as it happened.
quoted
I retried the same build 10 times and it would not reproduce - so this again is a hard to reproduce condition. (and there's no chance to get a proper tcpdump either, at these traffic levels)You really should start that capture, and on both client and server. You don't need to dump everything, only traffic to or from server:distcc.
It's not feasible. That box did in excess of 200 GB of network traffic in the past 7 hours alone. ~10 clients are doing make -j200 type of kernel builds to this 16way buildbox so it is not realistic to tcpdump it - especially given the rarity of this problem. (it has not reoccured since then) The network is local LAN, gigabit ethernet over a single gigabit switch. Ingo