Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
From: Chao Leng <hidden>
Date: 2021-03-17 02:56:30
On 2021/3/17 7:51, Sagi Grimberg wrote:
quoted
quoted
quoted
These patches on their own are correct because they fixed a controller reset regression. When we reset/teardown a controller, we must freeze and quiesce the namespaces request queues to make sure that we safely stop inflight I/O submissions. Freeze is mandatory because if our hctx map changed between reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and if it still has pending submissions (that are still quiesced) it will hang. This is what the above patches fixed. However, by freezing the namespaces request queues, and only unfreezing them when we successfully reconnect, inflight submissions that are running concurrently can now block grabbing the nshead srcu until either we successfully reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). This caused a deadlock [1] when a different controller (different path on the same subsystem) became live (i.e. optimized/non-optimized). This is because nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O in order to make sure that current_path is visible to future (re)submisions. However the srcu lock is taken by a blocked submission on a frozen request queue, and we have a deadlock. For multipath, we obviously cannot allow that as we want to failover I/O asap. However for non-mpath, we do not want to fail I/O (at least until controller FASTFAIL expires, and that is disabled by default btw). This creates a non-symmetric behavior of how the driver should behave in the presence or absence of multipath. [1]: Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] Call Trace: __schedule+0x293/0x730 schedule+0x33/0xa0 schedule_timeout+0x1d3/0x2f0 wait_for_completion+0xba/0x140 __synchronize_srcu.part.21+0x91/0xc0 synchronize_srcu_expedited+0x27/0x30 synchronize_srcu+0xce/0xe0 nvme_mpath_set_live+0x64/0x130 [nvme_core] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] nvme_update_ana_state+0xcd/0xe0 [nvme_core] nvme_parse_ana_log+0xa1/0x180 [nvme_core] nvme_read_ana_log+0x76/0x100 [nvme_core] nvme_mpath_init+0x122/0x180 [nvme_core] nvme_init_identify+0x80e/0xe20 [nvme_core] nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] In order to fix this, we recognize the different behavior a driver needs to take in error recovery scenarios for mpath and non-mpath scenarios and expose this awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs to be done.Christoph, Keith, Any thoughts on this? The RFC part is getting the transport driver to behave differently for mpath vs. non-mpath.Will it work if nvme mpath used request NOWAIT flag for its submit_bio() call, and add the bio to the requeue_list if blk_queue_enter() fails? I think that looks like another way to resolve the deadlock, but we need the block layer to return a failed status to the original caller.But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. It can also avoid I/O hang long time if reconnection failed.
_______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme .
_______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme