Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

From: Matt Mullins <hidden>
Date: 2019-06-04 03:19:08
Also in: bpf, lkml

On Tue, 2019-06-04 at 02:43 +0200, Daniel Borkmann wrote:

On 06/04/2019 01:54 AM, Alexei Starovoitov wrote:

quoted

On Mon, Jun 3, 2019 at 4:48 PM Daniel Borkmann [off-list ref] wrote:

quoted

On 06/04/2019 01:27 AM, Alexei Starovoitov wrote:

quoted

On Mon, Jun 3, 2019 at 3:59 PM Matt Mullins [off-list ref] wrote:

quoted

If these are invariably non-nested, I can easily keep bpf_misc_sd when
I resubmit.  There was no technical reason other than keeping the two
codepaths as similar as possible.

What resource gives you worry about doing this for the networking
codepath?

my preference would be to keep tracing and networking the same.
there is already minimal nesting in networking and probably we see
more when reuseport progs will start running from xdp and clsbpf

quoted

Aside from that it's also really bad to miss events like this as exporting
through rb is critical. Why can't you have a per-CPU counter that selects a
sample data context based on nesting level in tracing? (I don't see a discussion
of this in your commit message.)

This change would only drop messages if the same perf_event is
attempted to be used recursively (i.e. the same CPU on the same
PERF_EVENT_ARRAY map, as I haven't observed anything use index !=
BPF_F_CURRENT_CPU in testing).

I'll try to accomplish the same with a percpu nesting level and
allocating 2 or 3 perf_sample_data per cpu.  I think that'll solve the
same problem -- a local patch keeping track of the nesting level is how
I got the above stack trace, too.

I don't think counter approach works. The amount of nesting is unknown.
imo the approach taken in this patch is good.
I don't see any issue when event_outputs will be dropped for valid progs.
Only when user called the helper incorrectly without BPF_F_CURRENT_CPU.
But that's an error anyway.

My main worry with this xchg() trick is that we'll miss to export crucial
data with the EBUSY bailing out especially given nesting could increase in
future as you state, so users might have a hard time debugging this kind of
issue if they share the same perf event map among these programs, and no
option to get to this data otherwise. Supporting nesting up to a certain
level would still be better than a lost event which is also not reported
through the usual way aka perf rb.

Tracing can already be lossy: trace_call_bpf() silently simply doesn't
call the prog and instead returns zero if bpf_prog_active != 1.

quoted

I simply don't see this 'miss to export data' in all but contrived conditions.
Say two progs share the same perf event array.
One prog calls event_output and while rb logic is working
another prog needs to start executing and use the same event array

Correct.

quoted

slot. Today it's only possible for tracing prog combined with networking,
but having two progs use the same event output array is pretty much
a user bug. Just like not passing BPF_F_CURRENT_CPU.

I don't see the user bug part, why should that be a user bug? It's the same
as if we would say that sharing a BPF hash map between networking programs
attached to different hooks or networking and tracing would be a user bug
which it is not. One concrete example would be cilium monitor where we
currently expose skb trace and drop events a well as debug data through
the same rb. This should be usable from any type that has perf_event_output
helper enabled (e.g. XDP and tc/BPF) w/o requiring to walk yet another per
cpu mmap rb from user space.

Neither of these solutions would affect the behavior of sharing the
perf array between networking programs -- since they're never called in
a nested fashion, then you'll never hit the "xchg() returned NULL" at
all.

That said, I think I can logically limit nesting in tracing to 3
levels:
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to run, and
  - if we're really unlucky, the same in nmi
at most one of which can be a kprobe or perf event.

There is also a comment

/*
 * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
 * to avoid potential recursive reuse issue when/if tracepoints are added
 * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
 */

that suggests that one day bpf_perf_event_output might grow a static
tracepoint.  However, if the program attached to such a hypothetical
tracepoint were to call bpf_perf_event_output, that would infinitely
recurse ... it seems fine to let that case return -EBUSY as well.  It
does make me wonder if I should do the same nesting for the pt_regs.

I've now got an experiment running with the counter approach, and the
workload that I hit the original crash with seems to be fine with 2
layers' worth -- though if we decide that's the one I should move
forward with, I'll probably bump it to a third to be safe for an NMI.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help