Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support

From: Jesper Dangaard Brouer <hidden>
Date: 2022-10-04 11:22:02
Also in: imx, lkml

On 03/10/2022 14.49, Shenwei Wang wrote:

Hi Jesper,

quoted

On mvneta driver/platform we saw huge speedup replacing:

     page_pool_release_page(rxq->page_pool, page); with
     skb_mark_for_recycle(skb);

After replacing the page_pool_release_page with the
skb_mark_for_recycle, I found something confused me a little in the
testing result. >
I tested with the sample app of "xdpsock" under two modes: 
 1. Native (xdpsock -i eth0). 
 2. Skb-mode (xdpsock -S -i eth0).

Great that you are also testing AF_XDP, but do you have a particular
use-case that needs AF_XDP on this board?

What packet size are used in below results?

The following are the testing result:

      With page_pool_release_page (pps)  With skb_mark_for_recycle (pps)

  SKB-Mode                          90K                            200K
  Native                           190K                            190K

The default AF_XDP test with xdpsock is rxdrop IIRC.

Can you test the normal XDP code path and do a XDP_DROP test via the
samples tool 'xdp_rxq_info' and cmdline:

   sudo ./xdp_rxq_info --dev eth42 --act XDP_DROP --read

And then same with --skb-mode

The skb_mark_for_recycle solution boosted the performance of SKB-Mode
to 200K+ PPS. That is even higher than the performance of Native
solution.  Is this result reasonable? Do you have any clue why the
SKB-Mode performance can go higher than that of Native one?

I might be able to explain this (Cc. AF_XDP maintainers to keep me honest).

When you say "native" *AF_XDP* that isn't Zero-Copy AF_XDP.

Sure, XDP runs in native driver mode and redirects the raw frames into 
the AF_XDP socket, but as this isn't zero-copy AF_XDP. Thus, the packets 
needs to be copied into the AF_XDP buffers.

As soon as the frame or SKB (for generic XDP) have been copied it is 
released/freed by AF_XDP/xsk code (either via xdp_return_buff() or 
consume_skb()). Thus, it looks like it really pays off to recycle the 
frame via page_pool, also for the SKB consume_skb() case.

I am still a little surprised that to can be faster than native AF_XDP, 
as the SKB-mode ("XDP-generic") needs to call through lot more software 
layers and convert the SKB to look like an xdp_buff.

--Jesper

quoted

-----Original Message-----
From: Jesper Dangaard Brouer <redacted>
Sent: Thursday, September 29, 2022 1:55 PM
To: Shenwei Wang <shenwei.wang@nxp.com>; Jesper Dangaard Brouer
[off-list ref]; Andrew Lunn [off-list ref]
Cc: brouer@redhat.com; Joakim Zhang <redacted>; David S.
Miller [off-list ref]; Eric Dumazet [off-list ref]; Jakub
Kicinski [off-list ref]; Paolo Abeni [off-list ref]; Alexei
Starovoitov [off-list ref]; Daniel Borkmann [off-list ref];
Jesper Dangaard Brouer [off-list ref]; John Fastabend
[off-list ref]; netdev@vger.kernel.org; linux-
kernel@vger.kernel.org; imx@lists.linux.dev
Subject: Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support

Caution: EXT Email

On 29/09/2022 17.52, Shenwei Wang wrote:

quoted

From: Jesper Dangaard Brouer <redacted>

On 29/09/2022 15.26, Shenwei Wang wrote:

quoted

From: Andrew Lunn <andrew@lunn.ch>
Sent: Thursday, September 29, 2022 8:23 AM

[...]

quoted

I actually did some compare testing regarding the page pool for
normal traffic.  So far I don't see significant improvement in the
current implementation. The performance for large packets improves
a little, and the performance for small packets get a little worse.

What hardware was this for? imx51? imx6? imx7 Vybrid? These all use the

FEC.

quoted

I tested on imx8qxp platform. It is ARM64.

On mvneta driver/platform we saw huge speedup replacing:

     page_pool_release_page(rxq->page_pool, page); with
     skb_mark_for_recycle(skb);

As I mentioned: Today page_pool have SKB recycle support (you might
have looked at drivers that didn't utilize this yet), thus you don't
need to release the page (page_pool_release_page) here.  Instead you
could simply mark the SKB for recycling, unless driver does some page refcnt

tricks I didn't notice.

quoted

On the mvneta driver/platform the DMA unmap (in
page_pool_release_page) was very expensive. This imx8qxp platform
might have faster DMA unmap in case is it cache-coherent.

I would be very interested in knowing if skb_mark_for_recycle() helps
on this platform, for normal network stack performance.

Did a quick compare testing for the following 3 scenarios:

Thanks for doing this! :-)

quoted

1. original implementation

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size:  416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[  1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-1.0000 sec   104 MBytes   868 Mbits/sec
[  1] 1.0000-2.0000 sec   105 MBytes   878 Mbits/sec
[  1] 2.0000-3.0000 sec   105 MBytes   881 Mbits/sec
[  1] 3.0000-4.0000 sec   105 MBytes   879 Mbits/sec
[  1] 4.0000-5.0000 sec   105 MBytes   878 Mbits/sec
[  1] 5.0000-6.0000 sec   105 MBytes   878 Mbits/sec
[  1] 6.0000-7.0000 sec   104 MBytes   875 Mbits/sec
[  1] 7.0000-8.0000 sec   104 MBytes   875 Mbits/sec
[  1] 8.0000-9.0000 sec   104 MBytes   873 Mbits/sec
[  1] 9.0000-10.0000 sec   104 MBytes   875 Mbits/sec
[  1] 0.0000-10.0073 sec  1.02 GBytes   875 Mbits/sec

2. Page pool with page_pool_release_page

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size:  416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[  1] local 10.81.17.20 port 35924 connected with 10.81.16.245 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-1.0000 sec   101 MBytes   849 Mbits/sec
[  1] 1.0000-2.0000 sec   102 MBytes   860 Mbits/sec
[  1] 2.0000-3.0000 sec   102 MBytes   860 Mbits/sec
[  1] 3.0000-4.0000 sec   102 MBytes   859 Mbits/sec
[  1] 4.0000-5.0000 sec   103 MBytes   863 Mbits/sec
[  1] 5.0000-6.0000 sec   103 MBytes   864 Mbits/sec
[  1] 6.0000-7.0000 sec   103 MBytes   863 Mbits/sec
[  1] 7.0000-8.0000 sec   103 MBytes   865 Mbits/sec
[  1] 8.0000-9.0000 sec   103 MBytes   862 Mbits/sec
[  1] 9.0000-10.0000 sec   102 MBytes   856 Mbits/sec
[  1] 0.0000-10.0246 sec  1.00 GBytes   858 Mbits/sec


3. page pool with skb_mark_for_recycle

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size:  416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[  1] local 10.81.17.20 port 42724 connected with 10.81.16.245 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-1.0000 sec   111 MBytes   931 Mbits/sec
[  1] 1.0000-2.0000 sec   112 MBytes   935 Mbits/sec
[  1] 2.0000-3.0000 sec   111 MBytes   934 Mbits/sec
[  1] 3.0000-4.0000 sec   111 MBytes   934 Mbits/sec
[  1] 4.0000-5.0000 sec   111 MBytes   934 Mbits/sec
[  1] 5.0000-6.0000 sec   112 MBytes   935 Mbits/sec
[  1] 6.0000-7.0000 sec   111 MBytes   934 Mbits/sec
[  1] 7.0000-8.0000 sec   111 MBytes   933 Mbits/sec
[  1] 8.0000-9.0000 sec   112 MBytes   935 Mbits/sec
[  1] 9.0000-10.0000 sec   111 MBytes   933 Mbits/sec
[  1] 0.0000-10.0069 sec  1.09 GBytes   934 Mbits/sec

This is a very significant performance improvement (page pool with
skb_mark_for_recycle).  This is very close to the max goodput for a 1Gbit/s link.

quoted

For small packet size (64 bytes), all three cases have almost the same result:

To me this indicate, that the DMA map/unmap operations on this platform are
indeed more expensive on larger packets.  Given this is what page_pool does,
keeping the DMA mapping intact when recycling.

Driver still need DMA-sync, although I notice you set page_pool feature flag
PP_FLAG_DMA_SYNC_DEV, this is good as page_pool will try to reduce sync size
where possible. E.g. in this SKB case will reduce the DMA-sync to the
max_len=FEC_ENET_RX_FRSIZE which should also help on performance.

quoted

shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size:  416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[  1] local 10.81.17.20 port 58204 connected with 10.81.16.245 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-1.0000 sec  36.9 MBytes   309 Mbits/sec
[  1] 1.0000-2.0000 sec  36.6 MBytes   307 Mbits/sec
[  1] 2.0000-3.0000 sec  36.6 MBytes   307 Mbits/sec
[  1] 3.0000-4.0000 sec  36.5 MBytes   307 Mbits/sec
[  1] 4.0000-5.0000 sec  37.1 MBytes   311 Mbits/sec
[  1] 5.0000-6.0000 sec  37.2 MBytes   312 Mbits/sec
[  1] 6.0000-7.0000 sec  37.1 MBytes   311 Mbits/sec
[  1] 7.0000-8.0000 sec  37.1 MBytes   311 Mbits/sec
[  1] 8.0000-9.0000 sec  37.1 MBytes   312 Mbits/sec
[  1] 9.0000-10.0000 sec  37.2 MBytes   312 Mbits/sec
[  1] 0.0000-10.0097 sec   369 MBytes   310 Mbits/sec

Regards,
Shenwei

quoted

By small packets, do you mean those under the copybreak limit?

Please provide some benchmark numbers with your next patchset.

Yes, the packet size is 64 bytes and it is under the copybreak limit.
As the impact is not significant, I would prefer to remove the
copybreak  logic.

+1 to removing this logic if possible, due to maintenance cost.

--Jesper

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help