Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
From: Yonghong Song <hidden>
Date: 2021-11-18 19:13:32
Also in:
bpf, linux-fsdevel
On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:quoted
On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:quoted
This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the io_uring_ubuf itself. This allows the iterator to save info related to buffers added to an io_uring instance, that isn't easy to export using the fdinfo interface (like exact struct page composing the registered buffer). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: CPU 0 CPU 1 vfs_read mutex_lock(&seq_file->lock) io_read mutex_lock(&ctx->uring_lock) mutex_lock(&ctx->uring_lock) # switched to mutex_trylock mutex_lock(&seq_file->lock)It is not clear which mutex_lock switched to mutex_trylock.The one in vfs_read.quoted
From below example, it looks like &ctx->uring_lock. But if this is the case, we could have deadlock, right?Sorry, I will make the commit message clearer in the next version. The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So I switched the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The second problem without use of trylock is described below (for same case of io_uring reading from iterator attached to itself). Let me know if I missed something.
Thanks for explanation. The above diagram is much better.
quoted
quoted
The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <-----------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. In the future, this iterator will be extended to directly support iteration of bvec Flexible Array Member, so that when there is no corresponding VMA that maps to the registered buffer (e.g. if VMA is destroyed after pinning pages), we are able to reconstruct the registration on restore by dumping the page contents and then replaying them into a temporary mapping used for registration later. All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> --- fs/io_uring.c | 179 +++++++++++++++++++++++++++++++++ include/linux/bpf.h | 2 + include/uapi/linux/bpf.h | 3 + tools/include/uapi/linux/bpf.h | 3 + 4 files changed, 187 insertions(+)
[...]
quoted
quoted
[...] +static struct bpf_iter_reg io_uring_buf_reg_info = { + .target = "io_uring_buf", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_io_uring_iter_attach, + .detach_target = bpf_io_uring_iter_detach,Since you have this extra `io_uring_fd` for the iterator, you may want to implement show_fdinfo and fill_link_info callback functions here.Ack, but some questions: What should it have? e.g. it easy to go from map_id to map fd if one wants access to the map attached to the iterator, but not sure how one can obtain more information about target fd from io_uring or epoll iterators.
Just to be clear, I am talking about uapi struct bpf_link_info. I agree that fd is not really useful. So I guess it is up to you whether you want to show fd to user or not. We can always add it later if needed.
Should I delegate to their show_fdinfo and dump using that? The type/target is already available in link_info, not sure what other useful information can be added there, which allows obtaining the io_uring/epoll fd.quoted
quoted
+ .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__io_uring_buf, ctx), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__io_uring_buf, ubuf), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_io_uring_buf_seq_info, +}; + +static int __init io_uring_iter_init(void) +{ + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; + io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1]; + return bpf_iter_reg_target(&io_uring_buf_reg_info); +} +late_initcall(io_uring_iter_init); + +#endifdiff --git a/include/linux/bpf.h b/include/linux/bpf.h index 56098c866704..ddb9d4520a3f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h@@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags); extern int bpf_iter_ ## target(args); \ int __init bpf_iter_ ## target(args) { return 0; } +struct io_ring_ctx; struct bpf_iter_aux_info { struct bpf_map *map; + struct io_ring_ctx *ctx; };Can we use union here? Note that below bpf_iter_link_info in uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.So the reason I didn't use a union was the link->aux.map check in bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it needs some way to determine whether link is for map type, so maybe a string comparison there? Leaving it out of union felt cleaner, also I move both io_ring_ctx and eventpoll file into a union in later patch.
I see. the seq_ops for map element iterator is different from others. the seq_ops is not from main iter registration but from map_ops. I think your change is okay. But maybe a comment to explain why map is different from others in struct bpf_iter_aux_info.
quoted
quoted
typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 6297eafdc40f..3323defa99a1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h@@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 6297eafdc40f..3323defa99a1 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h@@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */-- Kartikeya