Re: bonding: time limits too tight in bond_ab_arp_inspect

From: Chris Friesen <hidden>
Date: 2012-08-22 19:00:14

On 08/22/2012 12:42 PM, Jay Vosburgh wrote:

Chris Friesen[off-list ref]  wrote:

quoted

On 08/22/2012 11:45 AM, Jiri Bohac wrote:

quoted

This code is run from bond_activebackup_arp_mon() about
delta_in_ticks jiffies after the previous ARP probe has been
sent. If the delayed work gets executed exactly in delta_in_ticks
jiffies, there is a chance the slave will be brought up.  If the
delayed work runs one jiffy later, the slave will stay down.

	Presumably the ARP reply is coming back in less than one jiffy,
then, so the slave_last_rx() value is the same jiffy as when the
_inspect was previously called?

quoted

<snip>

quoted

Should they perhaps all be increased by, say, delta_in_ticks/2, to make this
less dependent on the current scheduling latencies?

We have been using a patch that tracks the arpmon requested sleep time vs
the actual sleep time and adds any scheduling latency to the allowed
delta.  That way if we sleep too long due to scheduling latency it doesn't
affect the calculation.

	How much scheduling latency do you see?

	Is that really better than just permitting a bit more slack in
the timing window?

We hit enough latency that it triggered arpmon to falsely mark multiple 
links as lost.  This triggered our system maintenance code to go into a 
"oh no we can't talk to the outside world" secenario, which does fairly 
intrusive things to try and bring connectivity back up.  Basically a bad 
thing to happen just because of a random scheduler latency spike.

I should note that we added this some time back and are still running 
older kernels so I have no idea what latency on modern kernels is like.

Chris

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help