Re: nvme tcp receive errors
From: Sagi Grimberg <sagi@grimberg.me>
Date: 2021-05-11 09:31:11
Subsystem:
nvm express driver, the rest · Maintainers:
Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Linus Torvalds
quoted
quoted
Sagi, Just wanted to give you an update on where we're at with this. All tests run with your earlier patch removing the inline dispatch from nvme_tcp_queue_request() are successful. At this point, I am leaning to remove that optimization from mainline.Thanks Keith, Did you run it with the extra information debug patch I sent you? What I'm concerned about is that given that you have the only environment where this reproduces, and this is removed it will be very difficult to add it back in. Also, what about the read issue? that one is still unresolved from my PoV.The test team reports no issues for both read and write tests with the inline dispatch removed. That single patch appears to resolve all problems. Sorry, I should have clarified that initially. We will go back to the extra debug from last week on top of a pristine mainline kernel. There's been some confusion of which patches to apply on top of others in the mix, but I'm getting better at coordinating that.
Keith, I may have a theory to this issue. I think that the problem is in cases where we send commands with data to the controller and then in nvme_tcp_send_data between the last successful kernel_sendpage and before nvme_tcp_advance_req, the controller sends back a successful completion. If that is the case, then the completion path could be triggered, the tag would be reused, triggering a new .queue_rq, setting again the req.iter with the new bio params (all is not taken by the send_mutex) and then the send context would call nvme_tcp_advance_req progressing the req.iter with the former sent bytes... And given that the req.iter is used for reads/writes, it is possible that it can explain both issues. While this is not easy to trigger, there is nothing I think that can prevent that. The driver used to have a single context that would do both send and recv so this could not have happened, but now that we added the .queue_rq send context, I guess this can indeed confuse the driver. There are 3 possible options to resolve this: 1. Never touch the command after the last send (only advance the request iter after we check if we are done): --
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c51b70aec6dd..5c576eda5ba1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c@@ -933,7 +933,6 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
if (ret <= 0)
return ret;
- nvme_tcp_advance_req(req, ret);
if (queue->data_digest)
nvme_tcp_ddgst_update(queue->snd_hash, page,
offset, ret);@@ -950,6 +949,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
}
return 1;
}
+ nvme_tcp_advance_req(req, ret);
}
return -EAGAIN;
}
--
2. only initialize the request/cmd/pdu (nvme_tcp_setup_cmd_pdu)
when we have taken the send_mutex (say in nvme_tcp_fetch_request
or something), which is more cumbersome as this stuff may be
called multiple times in the life of a request.
3. Add refcounting to never complete an I/O before both send
and receive has fully completed on this command, besides having some
more overhead on the datapath, there maybe also some challanges with
moving the request state machine that may be based on the refounting.
I suggest that you guys give (1) a shot and see if the theory holds...
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme