Thread (39 messages) 39 messages, 4 authors, 2008-11-03

Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED]

From: Ilpo Järvinen <hidden>
Date: 2008-11-03 15:37:12

On Sun, 2 Nov 2008, Dâniel Fraga wrote:
On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
"Ilpo Järvinen" [off-list ref] wrote:
quoted
Perhaps we could try to solve it though stracing syslogd...
	Well Ilpo, you're right, what I'm about to write here will make
me very ashamed, but the truth must be told! The culprit was syslogd!
Almost unbeliavable, but I had been using and old syslogd version for
about 5 years!

	How can I'm sure that it's syslogd's fault? Simply, because I
had a stall today and when I killed syslogd everything was back to
normal.
Once there's any kind of flow control, anything jamming downstream will 
eventually make upstream to stall as well (or to appear as not working 
as expected. Sadly, it's exactly opposite from correctness point of view 
as flow control is a feature in TCP, not a bug :-)). Thus I occassionally 
run to these tcp with flow control not working reports which turn to be 
totally unrelated.

This still doesn't explain everything though afaik... E.g., why did the 
sendto() to SOCK_DGRAM socket hung.
	But no problem. I'll just wait a few more days to test if
syslogd is the only responsible for this, but I'm 90% sure it is.
And you had the same old syslogd on both hosts?

In any case the loss of every other character deterministically sounds 
like a real bug in the syslogd since it doesn't make too much sense to 
happen in kernel->syslogd communication (where I'd expect it to not show 
up in such consistent pattern but would cause more randomness).
	I apologize for thinking that it was a kernel fault.
It's not clear what caused this to happen _now_, nor the exact mechanism.
	Ps: just for curiosity, I was using a syslogd binary from Mar,
3, 2003! Extremely old! This is so old, it was compiled for Linux
2.2.5. Or maybe I was too lazy and copied it from another machine...
In theory this shouldn't be too big problem, but I'm hardly an expert of 
those things and syslogd is anyway more thightly coupled to kernel than 
some random app.
	Ps3: anyway, it's interesting how a small piece of the system
(syslogd) can generate those kinds of problems... I mean, a simple
error on syslogd could lead to a complete stall on connections, just
because everything is waiting for it to log through /dev/log.
This is more of a philosophical question than something else... it's 
always balancing between data loss (=possibly losing a logline of an 
important event) or possibility of a stall. But this shouldn't be a 
concern in the case where SOCK_DGRAM was used by the sudo (like in the 
strace you sent to sudo people), in general UDP doesn't guarantee 
reliability so not delivering wouldn't be a problem but I don't know if 
PF_FILE domain does something otherwise in there.
Of course
the problem was the binary, but it could have a time out, so even if it
was in fact a buggy syslogd, it won't cause such a stall on the
system. I really don't know what changed from 2.6.24 to 2.6.25, but
maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
you guys know much more than me.
Until we know more details than that killing syslogd helped it's hard to 
tell what is the actual cause. And I have no clue about semantics of 
/dev/log anyway.
	Ps4: maybe now we can understand why nmap solved the issue...
Not very clear but at least sudo does some writing there too.

-- 
 i.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help