Re: Command timeouts with NVMe TCP kernel driver | linux-nvme

quoted

On 9/9/21 12:50 AM, Keith Busch wrote:
On Wed, Sep 08, 2021 at 11:55:59PM +0300, Sagi Grimberg wrote:

1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
2. Userspace context is descheduled.
3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.

This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:

1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread

Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
Sounds possible. The "pending" check on the req_list was to fix a race
condition where the io_work could miss handling a request, resulting in
a different command time out. It sounds like that could make the io_work
spin without having anything to do, while starving the lower priority
task. I'll try to think of another way to handle it.
Instead of relying on io_work to determine if there's pending requests,
the queuing action can try to re-schedule it only after the send_mutex
is released, and I think that should address the issue you've described
while still fixing the IO timeout the "pending" trick was addressing.
Can you try the below patch?

Sagi, does this look okay to you?
It looks OK to me, but on the other hand the former
one looked perfectly fine to me.
It's a priority inversion. The .queue_rq() task is holding onto the
'req_list' resource via send_mutex, but it can't schedule in because
io_work is higher priority and will remain in TASK_RUNNING while
'req_list' is non-empty.
Yes, I understand the issue.

I think the user must have pinned the .queue_rq task to that specific
CPU, otherwise I believe it would have scheduled somewhere else to
unblock forward progress.
Yes, I agree.

Basically now the last request in the batch will not trigger
unconditionally (if the direct-send path was not taken), but only if it
sees queued requests, which in theory could be reaped by a running
io_work, that would result in an empty io_work invocation.  On one
hand it is undesired, on the other hand should be harmless as it
shouldn't miss an invocation... So it looks reasonable to me.
I can change the check to the same as .commit_rqs(), which just
consideres 'llist_empty(&queue->req_list))'.

That should be just as safe and have fewer unnecessary io_work
scheduling for the same reasoning commit_rqs() uses it. In any case, I
don't think an occasional unneeded work check is really that harmful.
What we need to think about is if there is a case where we miss
scheduling the io_work

Sam, can you run some more extensive I/O testing with this (different
patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
to also take it for a spin?
Sure, I'll request this patch for their next test cycle.
Great.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help