Re: io_uring: BPF controlled I/O

From: Pavel Begunkov <asml.silence@gmail.com>
Date: 2021-06-10 09:10:06
Also in: bpf, io-uring, lkml

On 6/7/21 7:51 PM, Victor Stewart wrote:

On Sat, Jun 5, 2021 at 5:09 AM Pavel Begunkov [off-list ref] wrote:

quoted

One of the core ideas behind io_uring is passing requests via memory
shared b/w the userspace and the kernel, a.k.a. queues or rings. That
serves a purpose of reducing number of context switches or bypassing
them, but the userspace is responsible for controlling the flow,
reaping and processing completions (a.k.a. Completion Queue Entry, CQE),
and submitting new requests, adding extra context switches even if there
is not much work to do. A simple illustration is read(open()), where
io_uring is unable to propagate the returned fd to the read, with more
cases piling up.

The big picture idea stays the same since last year, to give out some
of this control to BPF, allow it to check results of completed requests,
manipulate memory if needed and submit new requests. Apart from being
just a glue between two requests, it might even offer more flexibility
like keeping a QD, doing reduce/broadcast and so on.

The prototype [1,2] is in a good shape but some work need to be done.
However, the main concern is getting an understanding what features and
functionality have to be added to be flexible enough. Various toy
examples can be found at [3] ([1] includes an overview of cases).

Discussion points:
- Use cases, feature requests, benchmarking

hi Pavel,

coincidentally i'm tossing around in my mind at the moment an idea for
offloading
the PING/PONG of a QUIC server/client into the kernel via eBPF.

problem being, being that QUIC is userspace run transport and that NAT-ed UDP
mappings can't be expected to stay open longer than 30 seconds, QUIC
applications
bare a large cost of context switching wake-up to conduct connection lifetime
maintenance... especially when managing a large number of mostly idle long lived
connections. so offloading this maintenance service into the kernel
would be a great
efficiency boon.

the main impediment is that access to the kernel crypto libraries
isn't currently possible
from eBPF. that said, connection wide crypto offload into the NIC is a
frequently mentioned
subject in QUIC circles, so one could argue better to allocate the
time to NIC crypto offload
and then simply conduct this PING/PONG offload in plain text.

CQEs would provide a great way for the offloaded service to be able to
wake up the
application when it's input is required.

Interesting, want to try out the idea? All pointers are here
and/or in the patchset's cv, but if anything is not clear,
inconvenient, lacks needed functionality, etc. let me know

anyway food for thought.

Victor

quoted

- Userspace programming model, code reuse (e.g. liburing)
- BPF-BPF and userspace-BPF synchronisation. There is
CQE based notification approach and plans (see design
notes), however need to discuss what else might be
needed.
- Do we need more contexts passed apart from user_data?
e.g. specifying a BPF map/array/etc fd io_uring requests?
- Userspace atomics and efficiency of userspace reads/writes. If
proved to be not performant enough there are potential ways to take
on it, e.g. inlining, having it in BPF ISA, and pre-verifying
userspace pointers.

[1] https://lore.kernel.org/io-uring/a83f147b-ea9d-e693-a2e9-c6ce16659749@gmail.com/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b (local)
[2] https://github.com/isilence/linux/tree/ebpf_v2
[3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf

-----------------------------------------------------------------------
Design notes:

Instead of basing it on hooks it adds support of a new type of io_uring
requests as it gives a better control and let's to reuse internal
infrastructure. These requests run a new type of io_uring BPF programs
wired with a bunch of new helpers for submitting requests and dealing
with CQEs, are allowed to read/write userspace memory in virtue of a
recently added sleepable BPF feature. and also provided with a token
(generic io_uring token, aka user_data, specified at submission and
returned in an CQE), which may be used to pass a userspace pointer used
as a context.

Besides running BPF programs, they are able to request waiting.
Currently it supports CQ waiting for a number of completions, but others
might be added and/or needed, e.g. futex and/or requeueing the current
BPF request onto an io_uring request/link being submitted. That hides
the overhead of creating BPF requests by keeping them alive and
invoking multiple times.

Another big chunk solved is figuring out a good way of feeding CQEs
(potentially many) to a BPF program. The current approach
is to enable multiple completion queues (CQ), and specify for each
request to which one steer its CQE, so all the synchronisation
is in control of the userspace. For instance, there may be a separate
CQ per each in-flight BPF request, and they can work with their own
queues and send an CQE to the main CQ so notifying the userspace.
It also opens up a notification-like sync through CQE posting to
neighbours' CQs.

--
Pavel Begunkov

-- 
Pavel Begunkov

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help