Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it

From: Andrii Nakryiko <hidden>
Date: 2020-05-14 20:19:04
Also in: bpf

On Thu, May 14, 2020 at 10:33 AM [off-list ref] wrote:

On 05/13, Andrii Nakryiko wrote:

quoted

This commits adds a new MPSC ring buffer implementation into BPF
ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer.
On
the consumption side, only single consumer is assumed.

[...]

quoted

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs.
They
largely fell into few categores:
   - per-CPU buffers (perf, ftrace, etc), which don't satisfy two
motivations
     outlined above (ordering and memory consumption);
   - linked list-based implementations; while some were multi-producer
designs,
     consuming these from user-space would be very complicated and most
     probably not performant; memory-mapping contiguous piece of memory is
     simpler and more performant for user-space consumers;
   - io_uring is SPSC, but also requires fixed-sized elements. Naively
turning
     SPSC queue into MPSC w/ lock would have subpar performance compared to
     locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
     elements would be too limiting for BPF programs, given existing BPF
     programs heavily rely on variable-sized perf buffer already;
   - specialized implementations (like a new printk ring buffer, [0]) with
lots
     of printk-specific limitations and implications, that didn't seem to
fit
     well for intended use with BPF programs.

That's a very nice write up! Does it make sense to put most of it
under Documentation/bpf? We were discussing socket storage with KP
recently and I mentioned that commit 6ac99e8f23d4 has a really nice
description of the architecture with ascii diagrams/etc. Sometimes
it's really hard to chase down the commit history to find out
these sorts of details.

Sure, can do that. And thanks :)

quoted

   [0] https://lwn.net/Articles/779550/

quoted

Signed-off-by: Andrii Nakryiko <redacted>
---
  include/linux/bpf.h            |  12 +
  include/linux/bpf_types.h      |   1 +
  include/linux/bpf_verifier.h   |   4 +
  include/uapi/linux/bpf.h       |  33 ++-
  kernel/bpf/Makefile            |   2 +-
  kernel/bpf/helpers.c           |   8 +
  kernel/bpf/ringbuf.c           | 409 +++++++++++++++++++++++++++++++++
  kernel/bpf/syscall.c           |  12 +
  kernel/bpf/verifier.c          | 156 ++++++++++---
  kernel/trace/bpf_trace.c       |   8 +
  tools/include/uapi/linux/bpf.h |  33 ++-
  11 files changed, 643 insertions(+), 35 deletions(-)
  create mode 100644 kernel/bpf/ringbuf.c

[...]

quoted

+ * void bpf_ringbuf_submit(void *data)
+ *   Description
+ *           Submit reserved ring buffer sample, pointed to by *data*.
+ *   Return
+ *           Nothing.

Even though you mention self-pacing properties, would it still
make sense to add some argument to bpf_ringbuf_submit/bpf_ringbuf_output
to indicate whether to wake up userspace or not? Maybe something like
a threshold of number of outstanding events in the ringbuf after which
we do the wakeup? The default 0/1 preserve the existing behavior.

The example I can give is a control plane userspace thread that
once a second aggregates the events, it doesn't care about millisecond
resolution. With the current scheme, I suppose, if BPF generates events
every 1ms, the userspace will be woken up 1000 times (if it can keep
up). Most of the time, we don't really care and some buffering
properties are desired.

perf buffer has setting like this, and believe me, it's so confusing
and dangerous, that I wouldn't want this to be exposed. Even though I
was aware of this behavior, I still had to debug and work-around this
lack on wakeup few times, it's really-really confusing feature.

In your case, though, why wouldn't user-space poll data just once a
second, if it's not interested in getting data as fast as possible?

[...]

quoted

+struct bpf_ringbuf {
+     wait_queue_head_t waitq;
+     u64 mask;
+     spinlock_t spinlock ____cacheline_aligned_in_smp;
+     u64 consumer_pos __aligned(PAGE_SIZE);
+     u64 producer_pos __aligned(PAGE_SIZE);
+     char data[] __aligned(PAGE_SIZE);
+};
+
+struct bpf_ringbuf_map {
+     struct bpf_map map;
+     struct bpf_map_memory memory;
+     struct bpf_ringbuf *rb;
+};
+
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int
numa_node)
+{
+     const gfp_t flags = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN |
+                         __GFP_ZERO;
+     int nr_meta_pages = 2 + BPF_RINGBUF_PGOFF;

There is a bunch of magic '2's scattered around. Would it make sense
to use a proper define (with a comment) instead?

Yep, it's consumer + producer counter pages, I'll add a constant.

quoted

+     int nr_data_pages = data_sz >> PAGE_SHIFT;
+     int nr_pages = nr_meta_pages + nr_data_pages;
+     struct page **pages, *page;
+     size_t array_size;
+     void *addr;
+     int i;
+

[...]

Please trim. I do love my code of course, but scrolling through it so
many times gets old still ;)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help