Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark

From: Toke Høiland-Jørgensen <hidden>
Date: 2021-11-25 11:37:30

Joanne Koong [off-list ref] writes:

On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:

quoted

Andrii Nakryiko [off-list ref] writes:

quoted

On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen [off-list ref] wrote:

quoted

Joanne Koong [off-list ref] writes:

quoted

On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:

quoted

Joanne Koong [off-list ref] writes:

quoted

Add benchmark to measure the throughput and latency of the bpf_loop
call.

Testing this on qemu on my dev machine on 1 thread, the data is
as follows:

          nr_loops: 1
bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op

          nr_loops: 10
bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op

          nr_loops: 100
bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op

          nr_loops: 500
bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op

          nr_loops: 1000
bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op

          nr_loops: 5000
bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op

          nr_loops: 10000
bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op

          nr_loops: 50000
bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op

          nr_loops: 100000
bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op

          nr_loops: 500000
bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op

          nr_loops: 1000000
bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op

  From this data, we can see that the latency per loop decreases as the
number of loops increases. On this particular machine, each loop had an
overhead of about ~13 ns, and we were able to run ~70 million loops
per second.

The latency figures are great, thanks! I assume these numbers are with
retpolines enabled? Otherwise 12ns seems a bit much... Or is this
because of qemu?

I just tested it on a machine (without retpoline enabled) that runs on
actual
hardware and here is what I found:

              nr_loops: 1
      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op

              nr_loops: 10
      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op

              nr_loops: 100
      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op

              nr_loops: 500
      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op

              nr_loops: 1000
      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

              nr_loops: 5000
      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op

              nr_loops: 10000
      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op

              nr_loops: 50000
      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op

              nr_loops: 100000
      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op

              nr_loops: 500000
      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op

              nr_loops: 1000000
      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op

The latency is about ~4ns / loop.

I will update the commit message in v3 with these new numbers as well.

Right, awesome, thank you for the additional test. This is closer to
what I would expect: on the hardware I'm usually testing on, a function
call takes ~1.5ns, but the difference might just be the hardware, or
because these are indirect calls.

Another comparison just occurred to me (but it's totally OK if you don't
want to add any more benchmarks):

The difference between a program that does:

bpf_loop(nr_loops, empty_callback, NULL, 0);

and

for (i = 0; i < nr_loops; i++)
   empty_callback();

You are basically trying to measure the overhead of bpf_loop() helper
call itself, because other than that it should be identical.

No, I'm trying to measure the difference between the indirect call in
the helper, and the direct call from the BPF program. Should be minor
without retpolines, and somewhat higher where they are enabled...

quoted

We can estimate that already from the numbers Joanne posted above:

              nr_loops: 1
       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
              nr_loops: 1000
       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

nr_loops:1 is bpf_loop() overhead and one static callback call.
bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
17.6ns. I don't think we need yet another benchmark just for this.

That seems really high, though? The helper is a pretty simple function,
and the call to it should just be JIT'ed into a single regular function
call, right? So why the order-of-magnitude difference?

I think the overhead of triggering the bpf program from the userspace
benchmarking program is also contributing to this. When nr_loops = 1, we
have to do the context switch between userspace + kernel per every 1000 
loops;
this overhead also contributes to the latency numbers above

Right okay. But then that data point is not really measuring what it's
purporting to measure? That's a bit misleading, so maybe better to leave
it out entirely?

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help