Re: Command timeouts with NVMe TCP kernel driver

From: Samuel Jones <hidden>
Date: 2021-09-07 09:06:26

quoted

Hi all,

I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
have set up with a VM running 5.8 which is ok, but the same test load (100%RD
16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
The initatior complains as follows:

Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery

My target is a Kalray target, but I have no trace of any outstanding commands
when this situation occurs. Quite the opposite: The Kalray board observes that
the initiator stops sending new requests on the queue some time before this
command times out.

I don't have any similar issues using SPDK as an initiator. I made the following
modification to nvme-tcp code and my problem has disappeared:

--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c

@@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct

nvme_tcp_request *req,
* directly, otherwise queue io_work. Also, only do that if we
* are on the same cpu, so we don't introduce contention.
*/
+#if 0
if (queue->io_cpu == raw_smp_processor_id() &&
sync && empty && mutex_trylock(&queue->send_mutex)) {
queue->more_requests = !last;

@@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct

nvme_tcp_request *req,
} else if (last) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
}
+#else
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+#endif
}

To be honest, this bit of code has always bothered me: I don't understand how we
can guarantee that the thread doesn't change CPU between the call to
raw_smp_processor_id() and the trylock. I assume that even if this does occur,
the fact that we hold queue->send_mutex() is supposed to make sure that there
are no atomicity issues, but I'm concerned about the logic that tries to
determine if there is any more pending work and reschedule io_work if and only
if there is something to do. I can't pinpoint an issue from just reading the
code though..

Does anyone have any thoughts on this? Is my patch proof that there is an issue
in this code, or am I likely to be changing the program flow so significantly
that we can't conclude? Are there any open issues wrt command timeouts at the
moment?

I've seen similar, but I thought we'd fixed all those issues. Your poc
patch indicates to me that there is still at least another condition
that's not being considered. I don't think the possibility of CPU
rescheduling is the culprit, though.

One thing that looks problematic is if 'bd.last' is true but
nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
sequence. That could cause the io_work to never kick. There should have
been something else in dmesg indicating that error, though, so I'll look
for some other possibility.

Thanks for your help on this Keith. I think I have a better idea about what is
happening now. First of all, I have a large number of devices exposed over
NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is
multi-namespace, so I have a total of 37 block devices at once.

That shouldn't matter really, what matters is the number of controllers
that you have. you have 4 controllers, and how many queues are there
per-controller? and how many cores does the VM have?

quoted

Then I run a userspace benchmark that's running 8 threads per block device and
generating IO to each of them. I'm not sure how the multi queue block devices
are layered over the X NVMe queues per device, but however it works my system
is on its knees. I have been timestamping the time spent by NVMe-TCP threads
inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from
nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60
seconds. I am confident this is what is responsible for my timeouts. If I
increase the nvme_io_timeout to 300 seconds, my test case passes.

That is strange indeed. what are you taking the timing on? just
kernel_sendpage?

quoted

What's interesting is my quick patch that removed the direct call to
kernel_sendpage via queue_rq() makes the system behave much better: no more
outliers and no timeouts. I don't know much about the kernel network stack or
the block device layer but I imagine that the difference is due to the number
of calls into the network stack that can happen in parallel. Is queue_rq
executed from the context of the userspace caller? In this case I will have
close to 300 threads active on this code path. Whereas when we go through the
workqueue, I guess we have only N max calls into the network stack at once,
where N = the number of CPUs on the system?

What do you think? Does my analysis make any sense?

I don't think this is related to the number of contexts you have running
against the namespaces.

However, the difference between the submission from queue_rq and the
workqueue is that queue_rq may happen concurrently and hence may be
blocked by the socket mutex as the io_work does RX. Now I guess that
if RX is intensive (and it sounds like it is in your test) that the
TX path would be potentially starved...

But if the RX is so intensive, then I'd expect you to have timeouts
regardless as the context is not performing sends anyways, but maybe
this needs a bit more thought...

If this theory is correct, then a possible reason for the big delay
inside kernel_sendpage is due to the fact that io_work just ingest as
much as the stack gives it (which can be a lot). If we were to break
the recvs into fixed chunks, it would allow to send between recvs and
avoid this potential starvation.

Will it be possible to tryout this untested patch?
--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 032e8d582e6d..ec05b26cee38 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c

@@ -837,6 +837,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t

*desc, struct sk_buff *skb,
                }
        }

+       desc->count -= consumed;
        return consumed;
 }

@@ -1107,7 +1108,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue

*queue)
        int consumed;

        rd_desc.arg.data = queue;
-       rd_desc.count = 1;
+       rd_desc.count = 0x4000; /* 16k */
        lock_sock(sk);
        queue->nr_cqe = 0;
        consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
--

Hi Sagi, Keith, all,

I think I have a better idea about what is happening now. First of all, answers to Sagi's questions: I have 4 controllers with 8 queues each, the VM has 16 CPUs. I was timing on kernel_sendpage. I tried your patch Sagi but it didn't make any difference. Indeed, given that the queue stops transmitting requests, I guess the RX path gets very quiet very fast.

After a lot of time spent timing various functions in kernel_sendpage, the only thing I was able to observe was that my thread is descheduled and not rescheduled again for a LONG time. I think what is happening is the following.:

1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
2. Userspace context is descheduled.
3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.

This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:

1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread

Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?

Thanks for your help
Sam

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help