Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
From: Lijun Pan <hidden>
Date: 2021-04-22 05:07:33
Also in:
netdev
On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu [off-list ref] wrote:
Lijun Pan [ljp@linux.vnet.ibm.com] wrote:quoted
quoted
On Apr 20, 2021, at 4:35 PM, Dany Madden [off-list ref] wrote: When ibmvnic gets a FATAL error message from the vnicserver, it marks the Command Respond Queue (CRQ) inactive and resets the adapter. If this FATAL reset fails and a transmission timeout reset follows, the CRQ is still inactive, ibmvnic's attempt to set link down will also fail. If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. Instead, make the driver ignore this link down failure and continue to free and re-register CRQ so that the adapter has an opportunity to recover.This v2 does not adddress the concerns mentioned in v1. And I think it is better to exit with error from do_reset, and schedule a thorough do_hard_reset if the the adapter is already in unstable state.We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must talk to the firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() does just that in ibmvnic_reset_crq(). Now, sure we can attempt a "thorough hard reset" which also does the same hcalls to reestablish the connection. Is there any other magic in do_hard_reset()? But in addition, it also frees lot more Linux kernel buffers and reallocates them for instance.
Working around everything in do_reset will make the code very difficult to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset can be removed completely or merged into do_reset.
If we are having a communication problem with the VIOS, what is the point of freeing and reallocating Linux kernel buffers? Beside being inefficient, it would expose us to even more errors during reset under heavy workloads?
No real customer runs the system under that heavy load created by HTX stress test, which can tear down any working system.
From what I understand so far, do_reset() is complicated because it is attempting some optimizations. If we are going to fall back to hard reset for every error we might as well drop the do_reset() and just do the "thorough hard reset" every time right?
I think such optimizations are catered for passing HTX tests. Whether the optimization benefits the adapter, say making the adapter more stable, I doubt it. I think there should be a trade off between optimization and stability.