Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
From: Christoph Hellwig <hch@lst.de>
Date: 2021-03-18 04:46:19
On Thu, Mar 18, 2021 at 09:51:14AM +0800, Chao Leng wrote:
quoted
quoted
quoted
The multipath code would have to kick the list. We could also try to split into two flags, one that affects blk_queue_enter and one that affects the tag allocation.quoted
moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. It can also avoid I/O hang long time if reconnection failed.Can you explain how we'd still ensure that no new commands get queued during teardown using that scheme?1. tear down will cancel all inflight requests, and then multipath will clear the path. 2. and then we may freeze the controler. 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe.In non-mpath (which unfortunately is a valid use-case), there is no failover, and we cannot freeze the queue after we stopped (and/or started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over again) and the freeze will never complete (the commands are still inflight from the queue->g_usage_counter perspective).If the request set the flags to REQ_FAILFAST_xxx, will hang long time if reconnection failed. This is not expected. Another, If the controller is not live and the controller is freezed ,fast_io_fail_tmo will not work. This is also not expected. So I think freezing the controller when reconnecting is not good idea. It's really not good behavior to try again and again. This is at least better than request hang long time.
Well, it is pretty clear that REQ_FAILFAST_* (and I'm still confused about the three variants of that) should not block in blk_queue_enter, and we should make sure nvme-multipath triggers that. Let me thing of a good way to refactor blk_queue_enter first to make that least painful.
quoted
So I think we should still start queue freeze before we quiesce the queues.We should unquiesce and unfreeze the queues when reconnecting, otherwise fast_io_fail_tmo will not work.quoted
I still don't see how the mpath NOWAIT suggestion works either...mpath will queuue request to other live path or requeue the request(if no used path), so it will not wait.quoted
.
Yes. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme