Re: [Question] neighbor entry doesn't switch to the STALE state after the reachable timer expires
From: Julian Anastasov <ja@ssi.bg>
Date: 2023-01-29 19:44:21
Also in:
lkml
Hello, On Sun, 29 Jan 2023, Zhang Changzhong wrote:
Hi, We got the following weird neighbor cache entry on a machine that's been running for over a year: 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE
confirmed time (15994171) is 13 days in the future, more likely 185 days behind (very outdated), anything above 99 days is invalid
350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE
state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe.
After some analysis, we found a scenario that may cause such a neighbor entry:
Entry used DELAY_PROBE_TIME expired
NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE
|
| DELAY_PROBE_TIME not expired
v
NUD_REACHABLE
The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed +
NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2.
This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters
the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state
(the timer is set too long).
On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and
kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in
reality.
Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
such a neighbor entry?Is the neigh entry modified somehow, for example, with 'arp -s' or 'ip neigh change' ? Or is bond0 reconfigured after initial setup? I mean, 4 days ago? Looking at __neigh_update, there are few cases that can assign NUD_STALE without touching neigh->confirmed: lladdr = neigh->ha should be called, NEIGH_UPDATE_F_ADMIN should be provided. Later, as you explain, it can wrongly switch to NUD_REACHABLE state for long time. May be there should be some measures to keep neigh->confirmed valid during admin modifications. What is the kernel version? Regards -- Julian Anastasov [off-list ref]