Thread (17 messages) 17 messages, 7 authors, 2019-06-07

Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

From: Alexei Starovoitov <hidden>
Date: 2019-06-01 03:13:05
Also in: bpf, lkml

On Fri, May 31, 2019 at 6:28 PM Song Liu [off-list ref] wrote:

quoted
On May 31, 2019, at 3:37 PM, Matt Mullins [off-list ref] wrote:

It is possible that a BPF program can be called while another BPF
program is executing bpf_perf_event_output.  This has been observed with
I/O completion occurring as a result of an interrupt:

      bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
      ? trace_call_bpf+0x82/0x100
      ? sch_direct_xmit+0xe2/0x230
      ? blk_mq_end_request+0x1/0x100
      ? blk_mq_end_request+0x5/0x100
      ? kprobe_perf_func+0x19b/0x240
      ? __qdisc_run+0x86/0x520
      ? blk_mq_end_request+0x1/0x100
      ? blk_mq_end_request+0x5/0x100
      ? kprobe_ftrace_handler+0x90/0xf0
      ? ftrace_ops_assist_func+0x6e/0xe0
      ? ip6_input_finish+0xbf/0x460
      ? 0xffffffffa01e80bf
      ? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
      ? blkdev_issue_zeroout+0x200/0x200
      ? blk_mq_end_request+0x1/0x100
      ? blk_mq_end_request+0x5/0x100
      ? flush_smp_call_function_queue+0x6c/0xe0
      ? smp_call_function_single_interrupt+0x32/0xc0
      ? call_function_single_interrupt+0xf/0x20
      ? call_function_single_interrupt+0xa/0x20
      ? swiotlb_map_page+0x140/0x140
      ? refcount_sub_and_test+0x1a/0x50
      ? tcp_wfree+0x20/0xf0
      ? skb_release_head_state+0x62/0xc0
      ? skb_release_all+0xe/0x30
      ? napi_consume_skb+0xb5/0x100
      ? mlx5e_poll_tx_cq+0x1df/0x4e0
      ? mlx5e_poll_tx_cq+0x38c/0x4e0
      ? mlx5e_napi_poll+0x58/0xc30
      ? mlx5e_napi_poll+0x232/0xc30
      ? net_rx_action+0x128/0x340
      ? __do_softirq+0xd4/0x2ad
      ? irq_exit+0xa5/0xb0
      ? do_IRQ+0x7d/0xc0
      ? common_interrupt+0xf/0xf
      </IRQ>
      ? __rb_free_aux+0xf0/0xf0
      ? perf_output_sample+0x28/0x7b0
      ? perf_prepare_sample+0x54/0x4a0
      ? perf_event_output+0x43/0x60
      ? bpf_perf_event_output_raw_tp+0x15f/0x180
      ? blk_mq_start_request+0x1/0x120
      ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
      ? bpf_trace_run3+0x2c/0x80
      ? nbd_send_cmd+0x4c2/0x690 [nbd]

This also cannot be alleviated by further splitting the per-cpu
perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
corruption on concurrent perf_event_output calls")), as a raw_tp could
be attached to the block:block_rq_complete tracepoint and execute during
another raw_tp.  Instead, keep a pre-allocated perf_sample_data
structure per perf_event_array element and fail a bpf_perf_event_output
if that element is concurrently being used.

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <redacted>
This looks great. Thanks for the fix.

Acked-by: Song Liu <redacted>
quoted
---
v1->v2:
      keep a pointer to the struct perf_sample_data rather than directly
      embedding it in the structure, avoiding the circular include and
      removing the need for in_use.  Suggested by Song.

include/linux/bpf.h      |  1 +
kernel/bpf/arraymap.c    |  3 ++-
kernel/trace/bpf_trace.c | 29 ++++++++++++++++-------------
3 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4fb3aa2dc975..47fd85cfbbaf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -472,6 +472,7 @@ struct bpf_event_entry {
      struct file *perf_file;
      struct file *map_file;
      struct rcu_head rcu;
+     struct perf_sample_data *sd;
};

bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 584636c9e2eb..c7f5d593e04f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -654,11 +654,12 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
{
      struct bpf_event_entry *ee;

-     ee = kzalloc(sizeof(*ee), GFP_ATOMIC);
+     ee = kzalloc(sizeof(*ee) + sizeof(struct perf_sample_data), GFP_ATOMIC);
      if (ee) {
              ee->event = perf_file->private_data;
              ee->perf_file = perf_file;
              ee->map_file = map_file;
+             ee->sd = (void *)ee + sizeof(*ee);
This bit looks quite weird, but I don't have better ideas
to avoid circular .h pain.

Applied to bpf tree. Thanks
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help