Re: nvme tcp receive errors | linux-nvme

quoted

Sagi,

Just wanted to give you an update on where we're at with this.

All tests run with your earlier patch removing the inline dispatch from
nvme_tcp_queue_request() are successful. At this point, I am leaning to
remove that optimization from mainline.
Thanks Keith,

Did you run it with the extra information debug patch I sent you? What
I'm concerned about is that given that you have the only environment
where this reproduces, and this is removed it will be very difficult
to add it back in.

Also, what about the read issue? that one is still unresolved from
my PoV.
The test team reports no issues for both read and write tests with the
inline dispatch removed. That single patch appears to resolve all
problems. Sorry, I should have clarified that initially.

We will go back to the extra debug from last week on top of a pristine
mainline kernel. There's been some confusion of which patches to apply
on top of others in the mix, but I'm getting better at coordinating
that.
Keith,

I may have a theory to this issue. I think that the problem is in
cases where we send commands with data to the controller and then in
nvme_tcp_send_data between the last successful kernel_sendpage
and before nvme_tcp_advance_req, the controller sends back a successful
completion.

If that is the case, then the completion path could be triggered,
the tag would be reused, triggering a new .queue_rq, setting again
the req.iter with the new bio params (all is not taken by the
send_mutex) and then the send context would call nvme_tcp_advance_req
progressing the req.iter with the former sent bytes... And given that
the req.iter is used for reads/writes, it is possible that it can
explain both issues.

While this is not easy to trigger, there is nothing I think that
can prevent that. The driver used to have a single context that
would do both send and recv so this could not have happened, but
now that we added the .queue_rq send context, I guess this can
indeed confuse the driver.

There are 3 possible options to resolve this:
1. Never touch the command after the last send (only advance the
request iter after we check if we are done):
--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c51b70aec6dd..5c576eda5ba1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -933,7 +933,6 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                 if (ret <= 0)
                         return ret;

-               nvme_tcp_advance_req(req, ret);
                 if (queue->data_digest)
                         nvme_tcp_ddgst_update(queue->snd_hash, page,
                                         offset, ret);
@@ -950,6 +949,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                         }
                         return 1;
                 }
+               nvme_tcp_advance_req(req, ret);
         }
         return -EAGAIN;
  }
--

2. only initialize the request/cmd/pdu (nvme_tcp_setup_cmd_pdu)
when we have taken the send_mutex (say in nvme_tcp_fetch_request
or something), which is more cumbersome as this stuff may be
called multiple times in the life of a request.

3. Add refcounting to never complete an I/O before both send
and receive has fully completed on this command, besides having some
more overhead on the datapath, there maybe also some challanges with
moving the request state machine that may be based on the refounting.

I suggest that you guys give (1) a shot and see if the theory holds...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help