Re: [PATCH 0/2] nvme-fabrics: short-circuit connect retries | linux-nvme

quoted

On 2021/6/27 21:39, James Smart wrote:
On 6/26/2021 5:09 AM, Hannes Reinecke wrote:
On 6/26/21 3:03 AM, Chao Leng wrote:

On 2021/6/24 16:10, Hannes Reinecke wrote:
On 6/24/21 9:29 AM, Chao Leng wrote:

On 2021/6/24 13:51, Hannes Reinecke wrote:
On 6/23/21 11:38 PM, Sagi Grimberg wrote:

Hi all,

commit f25f8ef70ce2 ("nvme-fc: short-circuit reconnect retries")
allowed the fc transport to honour the DNR bit during reconnect
retries, allowing to speed up error recovery.
How does this speed up error recovery?
Well, not exactly error recovery (as there is nothing to recover).
But we won't attempt pointless retries, thereby reducing the noise in
the message log.
This conflict with the tcp and rdma target.
You may need to delete the improper NVME_SC_DNR at the target.
However, this will cause compatibility issues between different versions.
Which ones?
In many scenarios, the destination sets DNR for abnormal packets,
but each new connection may not have the same error.
This patch series is only for the DNR bit set in response to the 'connect' command.
If the target is not able to process the 'connect' command, but may be so in the future it really should not set the DNR bit.

I checked the DNR usage in the target code, and they seem to set it
correctly (ie the result would not change when the command is retried).
With the possible exception of ENOSPC handling, as this is arguably
dynamic and might change with a retry.
The DNR status of the old connection may not be relevant to the re-established connection.
See above.
We are just checking the DNR settings for the 'connect' command (or any other commands being sent during initial controller configuration).
If that fails the connect never was properly initialized; if the controller would return a different status after reconnect it simply should not set the DNR bit ...

Cheers,

Hannes
Agreed. Since 1.3 spec says: "If set to ‘1’, indicates that if the same command is re-submitted to any controller in the NVM subsystem, then that re-submitted command is expected to fail."
According to the definition of the protocol, this is not strictly implemented on the target side.
In nvme/target, there are many external errors, the DNR is set to 1.
For example, abormal fabrics cmd.
Thus if the initial connect fails in this manner, any new association will be on a different controller, where it is now expected connect on that controller will fail too.  Thus - why continue to connect when it's expected each will fail.
Agree.
We should not attempt to re-establish the connection if target can not work due to target inner error .
However, now  target does not behave exactly like this, so there are conflicts and compatibility issues.
-- james
.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help