Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs | linux-nvme

quoted

On 2021/3/17 7:51, Sagi Grimberg wrote:

These patches on their own are correct because they fixed a controller reset
regression.

When we reset/teardown a controller, we must freeze and quiesce the namespaces
request queues to make sure that we safely stop inflight I/O submissions.
Freeze is mandatory because if our hctx map changed between reconnects,
blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
if it still has pending submissions (that are still quiesced) it will hang.
This is what the above patches fixed.

However, by freezing the namespaces request queues, and only unfreezing them
when we successfully reconnect, inflight submissions that are running
concurrently can now block grabbing the nshead srcu until either we successfully
reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).

This caused a deadlock [1] when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is because
nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
in order to make sure that current_path is visible to future (re)submisions.
However the srcu lock is taken by a blocked submission on a frozen request
queue, and we have a deadlock.

For multipath, we obviously cannot allow that as we want to failover I/O asap.
However for non-mpath, we do not want to fail I/O (at least until controller
FASTFAIL expires, and that is disabled by default btw).

This creates a non-symmetric behavior of how the driver should behave in the
presence or absence of multipath.

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
   __schedule+0x293/0x730
   schedule+0x33/0xa0
   schedule_timeout+0x1d3/0x2f0
   wait_for_completion+0xba/0x140
   __synchronize_srcu.part.21+0x91/0xc0
   synchronize_srcu_expedited+0x27/0x30
   synchronize_srcu+0xce/0xe0
   nvme_mpath_set_live+0x64/0x130 [nvme_core]
   nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
   nvme_update_ana_state+0xcd/0xe0 [nvme_core]
   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
   nvme_read_ana_log+0x76/0x100 [nvme_core]
   nvme_mpath_init+0x122/0x180 [nvme_core]
   nvme_init_identify+0x80e/0xe20 [nvme_core]
   nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
   nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

In order to fix this, we recognize the different behavior a driver needs to take
in error recovery scenarios for mpath and non-mpath scenarios and expose this
awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
to be done.
Christoph, Keith,

Any thoughts on this? The RFC part is getting the transport driver to
behave differently for mpath vs. non-mpath.
Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
call, and add the bio to the requeue_list if blk_queue_enter() fails? I
think that looks like another way to resolve the deadlock, but we need
the block layer to return a failed status to the original caller.
But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
It can also avoid I/O hang long time if reconnection failed.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help