Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
From: ѽ҉ᶬḳ℠ <hidden>
Date: 2020-01-10 16:53:15
On 10/01/2020 16:32, Russell King - ARM Linux admin wrote:
On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote:quoted
On 10/01/2020 15:09, Russell King - ARM Linux admin wrote:quoted
On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:quoted
On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:quoted
quoted
quoted
Which is also indicating everything is correct. When the problem occurs, check the state of the signals again as close as possible to the event - it depends how long the transceiver keeps it asserted. You will probably find tx-fault is indicating "in hi IRQ".just discovered userland - gpioinfo pca9538 - which seems more verbose gpiochip2 - 8 lines: line 0: unnamed "tx-fault" input active-high [used] line 1: unnamed "tx-disable" output active-high [used] line 2: unnamed "rate-select0" input active-high [used] line 3: unnamed "los" input active-high [used] line 4: unnamed "mod-def0" input active-low [used] line 5: unnamed unused input active-high line 6: unnamed unused input active-high line 7: unnamed unused input active-high The above is depicting the current state with the module working, i.e. being online. Will do some testing and report back, not sure yet how to keep a close watch relating to the failure events.However, that doesn't give the current levels of the inputs, so it's useless for the purpose I've asked for.Fair enough. Operational (online) state gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ And the same remained (unchanged) during/after the events (as closely I was able to monitor) -> module transmit fault indicatedTry: while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done which may have a better chance of catching it.Suppose you are not interested in what happens with ifdown wan, so just for posterity gpio-504 ( |tx-fault ) in hi IRQ gpio-505 ( |tx-disable ) out hi gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQRight, because the state of TX_FAULT is not defined while TX_DISABLE is high.quoted
When the iif is brought up again and happens to trigger a transmit fault the hi is not being triggered however. And it did not try 5 times to recover from the fault, unless dmesg missed some [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not readyHere is where you brought the interface down.quoted
[Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link modeHere is where you brought the interface back up.quoted
[Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicatedThe module failed to deassert TX_FAULT within the 300ms window.quoted
[Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control off [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recoveredI'm not sure why these two messages are this way round; they should be the other way (I guess that's something to do with printk() no longer being synchronous.) Given that it seems to have taken two seconds to recover, that will be two reset attempts.quoted
[Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicatedThe module re-asserts TX_FAULT...quoted
[Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault, disablingAfter another three attempts, the module is still asserting TX_FAULT. We don't report every attempt made to recover the fault; that would flood the log. Instead, we report when the fault occurred, then try to reset the fault every second for up to five attempts _total_. If we exhaust the attempts, you get "module persistently indicates fault, disabling". If successfully recovered, you'll get "module transmit fault recovered". We don't reset the retries after a successful recovery as that would mean a transitory safety fault occurring once every few seconds would endlessly fill the kernel log, and also may go unnoticed. If it's spitting out safety faults like that, the module would be faulty.
Seems that the debug avenue has been exhausted, short of running SFP.C in debug mode. There is still no explanation why the module passes the 300 ms deassert TX_FAULT test most of the time but fails intermittently at other times, being kind of incoherent. Maybe it is just wishful thinking but it seems a bit far-fetched that the module is really causing this, least the readings from GPIO do not provide any such indicator. Could there be something choking / blocking the communication channel between the module and the kernel, some kernel code getting stuck / leaked in memory? Could the ipupdown routine, which has its own implementation in OpenWrt, be an interfering agent, e.g. the way it constructs or tears down the iif, though I do not see how?