Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
From: Andrii Nakryiko <hidden>
Date: 2020-05-14 20:19:04
Also in:
bpf
On Thu, May 14, 2020 at 10:33 AM [off-list ref] wrote:
On 05/13, Andrii Nakryiko wrote:quoted
This commits adds a new MPSC ring buffer implementation into BPF ecosystem, which allows multiple CPUs to submit data to a single shared ring buffer. On the consumption side, only single consumer is assumed.
[...]
quoted
Comparison to alternatives -------------------------- Before considering implementing BPF ring buffer from scratch existing alternatives in kernel were evaluated, but didn't seem to meet the needs. They largely fell into few categores: - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations outlined above (ordering and memory consumption); - linked list-based implementations; while some were multi-producer designs, consuming these from user-space would be very complicated and most probably not performant; memory-mapping contiguous piece of memory is simpler and more performant for user-space consumers; - io_uring is SPSC, but also requires fixed-sized elements. Naively turning SPSC queue into MPSC w/ lock would have subpar performance compared to locked reserve + lockless commit, as with BPF ring buffer. Fixed sized elements would be too limiting for BPF programs, given existing BPF programs heavily rely on variable-sized perf buffer already; - specialized implementations (like a new printk ring buffer, [0]) with lots of printk-specific limitations and implications, that didn't seem to fit well for intended use with BPF programs.That's a very nice write up! Does it make sense to put most of it under Documentation/bpf? We were discussing socket storage with KP recently and I mentioned that commit 6ac99e8f23d4 has a really nice description of the architecture with ascii diagrams/etc. Sometimes it's really hard to chase down the commit history to find out these sorts of details.
Sure, can do that. And thanks :)
quoted
[0] https://lwn.net/Articles/779550/quoted
Signed-off-by: Andrii Nakryiko <redacted> --- include/linux/bpf.h | 12 + include/linux/bpf_types.h | 1 + include/linux/bpf_verifier.h | 4 + include/uapi/linux/bpf.h | 33 ++- kernel/bpf/Makefile | 2 +- kernel/bpf/helpers.c | 8 + kernel/bpf/ringbuf.c | 409 +++++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 12 + kernel/bpf/verifier.c | 156 ++++++++++--- kernel/trace/bpf_trace.c | 8 + tools/include/uapi/linux/bpf.h | 33 ++- 11 files changed, 643 insertions(+), 35 deletions(-) create mode 100644 kernel/bpf/ringbuf.c
[...]
quoted
+ * void bpf_ringbuf_submit(void *data) + * Description + * Submit reserved ring buffer sample, pointed to by *data*. + * Return + * Nothing.Even though you mention self-pacing properties, would it still make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output to indicate whether to wake up userspace or not? Maybe something like a threshold of number of outstanding events in the ringbuf after which we do the wakeup? The default 0/1 preserve the existing behavior. The example I can give is a control plane userspace thread that once a second aggregates the events, it doesn't care about millisecond resolution. With the current scheme, I suppose, if BPF generates events every 1ms, the userspace will be woken up 1000 times (if it can keep up). Most of the time, we don't really care and some buffering properties are desired.
perf buffer has setting like this, and believe me, it's so confusing and dangerous, that I wouldn't want this to be exposed. Even though I was aware of this behavior, I still had to debug and work-around this lack on wakeup few times, it's really-really confusing feature. In your case, though, why wouldn't user-space poll data just once a second, if it's not interested in getting data as fast as possible? [...]
quoted
+struct bpf_ringbuf { + wait_queue_head_t waitq; + u64 mask; + spinlock_t spinlock ____cacheline_aligned_in_smp; + u64 consumer_pos __aligned(PAGE_SIZE); + u64 producer_pos __aligned(PAGE_SIZE); + char data[] __aligned(PAGE_SIZE); +}; + +struct bpf_ringbuf_map { + struct bpf_map map; + struct bpf_map_memory memory; + struct bpf_ringbuf *rb; +}; + +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node) +{ + const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | + __GFP_ZERO; + int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;There is a bunch of magic '2's scattered around. Would it make sense to use a proper define (with a comment) instead?
Yep, it's consumer + producer counter pages, I'll add a constant.
quoted
+ int nr_data_pages = data_sz >> PAGE_SHIFT; + int nr_pages = nr_meta_pages + nr_data_pages; + struct page **pages, *page; + size_t array_size; + void *addr; + int i; +
[...] Please trim. I do love my code of course, but scrolling through it so many times gets old still ;)