Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
From: Sagi Grimberg <sagi@grimberg.me>
Date: 2021-03-16 05:05:20
Does the problem exist on the latest version?
This was seen on 5.4 stable, not upstream but nothing prevents this from happening in upstream code.
We also found Similar deadlocks in the older version. However, with the latest code, it do not block grabbing the nshead srcu when ctrl is freezed. related patches: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead I am not sure they are the same problem.
Its not the same problem. When we teardown the io queues, we freeze the namespaces request queues. This means that concurrent mpath submit_bio calls can now block with the srcu lock taken. When another path calls nvme_mpath_set_live, it needs to wait for the srcu to sync before kicking the requeue work (to make sure the updated current_path is visible). And this is where the hang is, the only thing that will free it is if the offending controller reconnects (and unfreeze the queue) or it will disconnect (automatically or manually). Both can take a very long time or even forever in some cases. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme