Re: [Question] neighbor entry doesn't switch to the STALE state after the reachable timer expires
From: Zhang Changzhong <hidden>
Date: 2023-01-30 03:19:58
Also in:
lkml
On 2023/1/30 3:43, Julian Anastasov wrote:
Hello, On Sun, 29 Jan 2023, Zhang Changzhong wrote:quoted
Hi, We got the following weird neighbor cache entry on a machine that's been running for over a year: 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLEconfirmed time (15994171) is 13 days in the future, more likely 185 days behind (very outdated), anything above 99 days is invalidquoted
350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe. After some analysis, we found a scenario that may cause such a neighbor entry: Entry used DELAY_PROBE_TIME expired NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE | | DELAY_PROBE_TIME not expired v NUD_REACHABLE The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2. This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state (the timer is set too long). On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in reality. Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause such a neighbor entry?Is the neigh entry modified somehow, for example, with 'arp -s' or 'ip neigh change' ? Or is bond0 reconfigured after initial setup? I mean, 4 days ago?>
So far, we haven't found any user-space program that modifies the neigh entry or bond0. In fact, the neigh entry has been rarely used since initialization. 4 days ago, our machine just needed to download files from 172.16.1.18. However, the laddr has changed, and the neigh entry wrongly switched to NUD_REACHABLE state, causing the laddr to fail to update.
Looking at __neigh_update, there are few cases that can assign NUD_STALE without touching neigh->confirmed: lladdr = neigh->ha should be called, NEIGH_UPDATE_F_ADMIN should be provided. Later, as you explain, it can wrongly switch to NUD_REACHABLE state for long time. May be there should be some measures to keep neigh->confirmed valid during admin modifications.
This problem can also occur if the neigh entry stays in NUD_STALE state for more than 99 days, even if it is not modified by the administrator.
What is the kernel version?
We encountered this problem in 4.4 LTS, and the mainline doesn't seem to fix it yet. Regards, Changzhong Zhang