Re: [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API

From: Jesper Dangaard Brouer <hidden>
Date: 2018-03-26 11:43:02

On Fri, 23 Mar 2018 10:29:29 -0700 Alexander Duyck [off-list ref] wrote:

On Fri, Mar 23, 2018 at 5:19 AM, Jesper Dangaard Brouer
[off-list ref] wrote:

quoted

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

Signed-off-by: Jesper Dangaard Brouer <redacted>

I'm really not a fan of this patch. It seems like it is adding a ton
of overhead for one case that is going to penalize all of the other
use cases for XDP. Just the extra prefetch is going to have a
significant impact on things like the XDP_DROP case.

[...]

quoted

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ff069597fccf..e6e9b28ecfba 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c

@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,

                /* free the skb */
                if (ring_is_xdp(tx_ring))
-                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+                       xdp_return_frame(tx_buffer->xdpf);
                else
                        napi_consume_skb(tx_buffer->skb, napi_budget);

@@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
                        xdp.data_hard_start = xdp.data -
                                              ixgbe_rx_offset(rx_ring);
                        xdp.data_end = xdp.data + size;
+                       prefetchw(xdp.data_hard_start); /* xdp_frame write */

                        skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
                }

Do we really need to be prefetching yet another cache line? This is
going to hurt general performance since for most cases you are now
prefetching a cache line that will never be used.

My intuition were also that this would hurt performance.  But I've done
many tests, and they all show no performance regression from this change!

XDP_DROP testing with xdp1 on ixgbe:

 Baseline: 14,703,344 pkt/s
 Patched:  14,711,574 pkt/s

For people reproducing, notice that it requires tuning on the generator
side (with pktgen) to reach these extreme speeds.

As you might have noticed, the prefetchw is only active when a XDP
program is loaded.  Thus, I created a benchmark[1] that loads an XDP
program and always returns XDP_PASS (via --action)

Then, I drop packets in iptables raw table:

 Baseline: 5,783,509 pps
 Patched:  5,789,195 pps

Then unload netfilter modules and let packets reach UDP step
"UdpNoPorts" listening stage:

 Baseline: 3,832,956 pps
 Patched:  3,855,470 pps

Then, add a udp_sink (--recvmsg) to allow packets to travel deeper into
the network stack (and force udp_sink to run on another CPU):

 Baseline: 2,278,855 pps
 Patched:  2,270,373 pps

By measurements, it should be clear that this patchset does not add "a
ton of overhead" and does not "penalize all of the other use cases for
XDP".

For the XDP_REDIRECT use-case, I actually find the prefetchw() quite
beautiful, because xdp_buff (which is stack variable) is used by the
bpf_prog, and the prefetch have time to fetch the memory area that the
conversion of xdp_buff to xdp_frame goes into, and xdp_buff is known to
be hot.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_bench01_mem_access_cost_kern.c

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help