Thread (16 messages) 16 messages, 4 authors, 2021-02-23

Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

From: Kanchan Joshi <hidden>
Date: 2021-02-22 13:46:06
Also in: linux-fsdevel, linux-nvme, lkml

On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe [off-list ref] wrote:
On 1/28/21 10:13 AM, Kanchan Joshi wrote:
quoted
On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe [off-list ref] wrote:
quoted
On 1/28/21 5:04 AM, Kanchan Joshi wrote:
quoted
On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov [off-list ref] wrote:
quoted
On 27/01/2021 15:42, Pavel Begunkov wrote:
quoted
On 27/01/2021 15:00, Kanchan Joshi wrote:
quoted
This RFC patchset adds asynchronous ioctl capability for NVMe devices.
Purpose of RFC is to get the feedback and optimize the path.

At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
presented to user-space applications. Like regular-ioctl, it takes
ioctl opcode and an optional argument (ioctl-specific input/output
parameter). Unlike regular-ioctl, it is made to skip the block-layer
and reach directly to the underlying driver (nvme in the case of this
patchset). This path between io-uring and nvme is via a newly
introduced block-device operation "async_ioctl". This operation
expects io-uring to supply a callback function which can be used to
report completion at later stage.

For a regular ioctl, NVMe driver submits the command to the device and
the submitter (task) is made to wait until completion arrives. For
async-ioctl, completion is decoupled from submission. Submitter goes
back to its business without waiting for nvme-completion. When
nvme-completion arrives, it informs io-uring via the registered
completion-handler. But some ioctls may require updating certain
ioctl-specific fields which can be accessed only in context of the
submitter task. For that reason, NVMe driver uses task-work infra for
that ioctl-specific update. Since task-work is not exported, it cannot
be referenced when nvme is compiled as a module. Therefore, one of the
patch exports task-work API.

Here goes example of usage (pseudo-code).
Actual nvme-cli source, modified to issue all ioctls via this opcode
is present at-
https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops

Looks like good time to bring that branch/discussion back
a bit more context:
https://github.com/axboe/liburing/issues/270
Thanks, it looked good. It seems key differences (compared to
uring-patch that I posted) are -
1. using file-operation instead of block-dev operation.
Right, it's meant to span wider than just block devices.
quoted
2. repurpose the sqe memory for ioctl-cmd. If an application does
ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
That's nifty. We still need to support passing larger-cmd (e.g.
nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
difficult I suppose.
It's actually 48 bytes in the as-posted version, and I've bumped it to
56 bytes in the latest branch. So not quite enough for everything,
nothing ever will be, but should work for a lot of cases without
requiring per-command allocations just for the actual command.
Agreed. But if I got it right, you are open to support both in-the-sqe
command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
interface.
Driver processing the ioctl can fetch the cmd from user-space in one
case (as it does now), and skips in another.
Your out-of-seq command would be none of io_urings business, outside of
the fact that we'd need to ensure it's stable if we need to postpone
it. So yes, that would be fine, it just means your actual command is
passed in as a pointer, and you would be responsible for copying it
in for execution

We're going to need something to handle postponing, and something
for ensuring that eg cancelations free the allocated memory.
I have few doubts about allocation/postponing. Are you referring to
uring allocating memory for this case, similar to the way
"req->async_data" is managed for few other opcodes?
Or can it (i.e. larger cmd) remain a user-space pointer, and the
underlying driver fetches the command in.
If submission context changes (for sqo/io-wq case), uring seemed to
apply context-grabbing techniques to make that work.
quoted
quoted
quoted
And for some ioctls, driver may still need to use task-work to update
the user-space pointers (embedded in uring/ioctl cmd) during
completion.

@Jens - will it be fine if I start looking at plumbing nvme-part of
this series on top of your work?
Sure, go ahead. Just beware that things are still changing, so you might
have to adapt it a few times. It's still early days, but I do think
that's the way forward in providing controlled access to what is
basically async ioctls.
Sounds good, I will start with the latest branch that you posted. Thanks.
It's io_uring-fops.v2 for now, use that one.
Moved to v3 now.
nvme_user_io is 48 bytes, while nvme passthrough requires 72 or 80
bytes (passthru with 64 bit result).
The block_uring_cmd has 32 bytes of available space. If NVMe defines
its own "nvme_uring_cmd" (which can be used for nvme char interface)
that will buy some more space, but still won't be enough for passthru
command.

So I am looking at adding support for large-cmd in uring. And felt the
need to clear those doubts I mentioned above.


Thanks
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help