Re: [TCP bug] stuck distcc connections in latest -git
From: Ingo Molnar <hidden>
Date: 2008-07-23 08:28:00
Also in:
lkml
* Willy Tarreau [off-list ref] wrote:
On Tue, Jul 22, 2008 at 05:34:43PM +0200, Ingo Molnar wrote:quoted
* David Newall [off-list ref] wrote:quoted
Ingo Molnar wrote:quoted
* David Newall [off-list ref] wrote:quoted
You really should start that capture, and on both client and server. You don't need to dump everything, only traffic to or from server:distcc.It's not feasible. That box did in excess of 200 GB of network traffic in the past 7 hours alone.You only need distcc traffic, and perhaps only after it's hung. With 250k outstanding per socket, are you certain that no traffic was sent? Is it certain that one packet wasn't being sent each three minutes? I suppose you're right and the stack really is stuck, but this is such an easy thing to check and eliminate that you should do so. I suppose, too, that you should trace the server-side processes and confirm that they are waiting for socket input. You should dump tcp (for the distcc port) next time the problem recurs and also check that the server processes are waiting for socket input.ok, will do that if it happens again.Ingo, if it can help, I have a "capture" script which allows you to define a size and will rotate captures within that size. That's what I'm using to troubleshoot rarely occuring problems in datacenters, so it's horrible but efficient :-) You just have to stop it once the problem has happened again. Ping me if you're interested (I'm lazy to start my laptop right just for it now in fact).
yeah, that would be handy, thanks. Alas, the problem has not reoccured since then - more than a thousand kernel builds down the line. Yesterday it triggered so quickly when i updated the buildbox to the new kernel, and happened repeatedly when i tried to build a new kernel, that i didnt assume it was something hard to reproduce - but it went poof after i restarted distccd on the server. So i'd suggest we do not count this as a regression, i've got no way at the moment of reproducing it reliably. Ingo