Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

From: Matt Mullins <hidden>
Date: 2019-06-03 22:59:14
Also in: bpf, lkml

On Mon, 2019-06-03 at 15:22 +0200, Daniel Borkmann wrote:

On 06/03/2019 03:08 PM, Daniel Borkmann wrote:

quoted

On 06/01/2019 12:37 AM, Matt Mullins wrote:

quoted

It is possible that a BPF program can be called while another BPF
program is executing bpf_perf_event_output.  This has been observed with
I/O completion occurring as a result of an interrupt:

	bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
	? trace_call_bpf+0x82/0x100
	? sch_direct_xmit+0xe2/0x230
	? blk_mq_end_request+0x1/0x100
	? blk_mq_end_request+0x5/0x100
	? kprobe_perf_func+0x19b/0x240
	? __qdisc_run+0x86/0x520
	? blk_mq_end_request+0x1/0x100
	? blk_mq_end_request+0x5/0x100
	? kprobe_ftrace_handler+0x90/0xf0
	? ftrace_ops_assist_func+0x6e/0xe0
	? ip6_input_finish+0xbf/0x460
	? 0xffffffffa01e80bf
	? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
	? blkdev_issue_zeroout+0x200/0x200
	? blk_mq_end_request+0x1/0x100
	? blk_mq_end_request+0x5/0x100
	? flush_smp_call_function_queue+0x6c/0xe0
	? smp_call_function_single_interrupt+0x32/0xc0
	? call_function_single_interrupt+0xf/0x20
	? call_function_single_interrupt+0xa/0x20
	? swiotlb_map_page+0x140/0x140
	? refcount_sub_and_test+0x1a/0x50
	? tcp_wfree+0x20/0xf0
	? skb_release_head_state+0x62/0xc0
	? skb_release_all+0xe/0x30
	? napi_consume_skb+0xb5/0x100
	? mlx5e_poll_tx_cq+0x1df/0x4e0
	? mlx5e_poll_tx_cq+0x38c/0x4e0
	? mlx5e_napi_poll+0x58/0xc30
	? mlx5e_napi_poll+0x232/0xc30
	? net_rx_action+0x128/0x340
	? __do_softirq+0xd4/0x2ad
	? irq_exit+0xa5/0xb0
	? do_IRQ+0x7d/0xc0
	? common_interrupt+0xf/0xf
	</IRQ>
	? __rb_free_aux+0xf0/0xf0
	? perf_output_sample+0x28/0x7b0
	? perf_prepare_sample+0x54/0x4a0
	? perf_event_output+0x43/0x60
	? bpf_perf_event_output_raw_tp+0x15f/0x180
	? blk_mq_start_request+0x1/0x120
	? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
	? bpf_trace_run3+0x2c/0x80
	? nbd_send_cmd+0x4c2/0x690 [nbd]

This also cannot be alleviated by further splitting the per-cpu
perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
corruption on concurrent perf_event_output calls")), as a raw_tp could
be attached to the block:block_rq_complete tracepoint and execute during
another raw_tp.  Instead, keep a pre-allocated perf_sample_data
structure per perf_event_array element and fail a bpf_perf_event_output
if that element is concurrently being used.

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <redacted>

You do not elaborate why is this needed for all the networking programs that
use this functionality. The bpf_misc_sd should therefore be kept as-is. There
cannot be nested occurrences there (xdp, tc ingress/egress). Please explain why
non-tracing should be affected here...

If these are invariably non-nested, I can easily keep bpf_misc_sd when
I resubmit.  There was no technical reason other than keeping the two
codepaths as similar as possible.

What resource gives you worry about doing this for the networking
codepath?

Aside from that it's also really bad to miss events like this as exporting
through rb is critical. Why can't you have a per-CPU counter that selects a
sample data context based on nesting level in tracing? (I don't see a discussion
of this in your commit message.)

This change would only drop messages if the same perf_event is
attempted to be used recursively (i.e. the same CPU on the same
PERF_EVENT_ARRAY map, as I haven't observed anything use index !=
BPF_F_CURRENT_CPU in testing).

I'll try to accomplish the same with a percpu nesting level and
allocating 2 or 3 perf_sample_data per cpu.  I think that'll solve the
same problem -- a local patch keeping track of the nesting level is how
I got the above stack trace, too.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help