Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support
From: Jesper Dangaard Brouer <hidden>
Date: 2022-10-04 11:22:02
Also in:
imx, lkml
On 03/10/2022 14.49, Shenwei Wang wrote:
Hi Jesper,quoted
quoted
quoted
On mvneta driver/platform we saw huge speedup replacing: page_pool_release_page(rxq->page_pool, page); with skb_mark_for_recycle(skb);After replacing the page_pool_release_page with the skb_mark_for_recycle, I found something confused me a little in the testing result. > I tested with the sample app of "xdpsock" under two modes: 1. Native (xdpsock -i eth0). 2. Skb-mode (xdpsock -S -i eth0).
Great that you are also testing AF_XDP, but do you have a particular use-case that needs AF_XDP on this board? What packet size are used in below results?
The following are the testing result:
>
With page_pool_release_page (pps) With skb_mark_for_recycle (pps) SKB-Mode 90K 200K Native 190K 190K
The default AF_XDP test with xdpsock is rxdrop IIRC. Can you test the normal XDP code path and do a XDP_DROP test via the samples tool 'xdp_rxq_info' and cmdline: sudo ./xdp_rxq_info --dev eth42 --act XDP_DROP --read And then same with --skb-mode
The skb_mark_for_recycle solution boosted the performance of SKB-Mode to 200K+ PPS. That is even higher than the performance of Native solution. Is this result reasonable? Do you have any clue why the SKB-Mode performance can go higher than that of Native one?
I might be able to explain this (Cc. AF_XDP maintainers to keep me honest).
When you say "native" *AF_XDP* that isn't Zero-Copy AF_XDP.
Sure, XDP runs in native driver mode and redirects the raw frames into
the AF_XDP socket, but as this isn't zero-copy AF_XDP. Thus, the packets
needs to be copied into the AF_XDP buffers.
As soon as the frame or SKB (for generic XDP) have been copied it is
released/freed by AF_XDP/xsk code (either via xdp_return_buff() or
consume_skb()). Thus, it looks like it really pays off to recycle the
frame via page_pool, also for the SKB consume_skb() case.
I am still a little surprised that to can be faster than native AF_XDP,
as the SKB-mode ("XDP-generic") needs to call through lot more software
layers and convert the SKB to look like an xdp_buff.
--Jesper
quoted
-----Original Message----- From: Jesper Dangaard Brouer <redacted> Sent: Thursday, September 29, 2022 1:55 PM To: Shenwei Wang <shenwei.wang@nxp.com>; Jesper Dangaard Brouer [off-list ref]; Andrew Lunn [off-list ref] Cc: brouer@redhat.com; Joakim Zhang <redacted>; David S. Miller [off-list ref]; Eric Dumazet [off-list ref]; Jakub Kicinski [off-list ref]; Paolo Abeni [off-list ref]; Alexei Starovoitov [off-list ref]; Daniel Borkmann [off-list ref]; Jesper Dangaard Brouer [off-list ref]; John Fastabend [off-list ref]; netdev@vger.kernel.org; linux- kernel@vger.kernel.org; imx@lists.linux.dev Subject: Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support Caution: EXT Email On 29/09/2022 17.52, Shenwei Wang wrote:quoted
quoted
From: Jesper Dangaard Brouer <redacted> On 29/09/2022 15.26, Shenwei Wang wrote:quoted
quoted
From: Andrew Lunn <andrew@lunn.ch> Sent: Thursday, September 29, 2022 8:23 AM[...]quoted
quoted
quoted
I actually did some compare testing regarding the page pool for normal traffic. So far I don't see significant improvement in the current implementation. The performance for large packets improves a little, and the performance for small packets get a little worse.What hardware was this for? imx51? imx6? imx7 Vybrid? These all use theFEC.quoted
quoted
quoted
I tested on imx8qxp platform. It is ARM64.On mvneta driver/platform we saw huge speedup replacing: page_pool_release_page(rxq->page_pool, page); with skb_mark_for_recycle(skb); As I mentioned: Today page_pool have SKB recycle support (you might have looked at drivers that didn't utilize this yet), thus you don't need to release the page (page_pool_release_page) here. Instead you could simply mark the SKB for recycling, unless driver does some page refcnttricks I didn't notice.quoted
quoted
On the mvneta driver/platform the DMA unmap (in page_pool_release_page) was very expensive. This imx8qxp platform might have faster DMA unmap in case is it cache-coherent. I would be very interested in knowing if skb_mark_for_recycle() helps on this platform, for normal network stack performance.Did a quick compare testing for the following 3 scenarios:Thanks for doing this! :-)quoted
1. original implementation shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 ------------------------------------------------------------ Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416 KByte (WARNING: requested 1.91 MByte) ------------------------------------------------------------ [ 1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-1.0000 sec 104 MBytes 868 Mbits/sec [ 1] 1.0000-2.0000 sec 105 MBytes 878 Mbits/sec [ 1] 2.0000-3.0000 sec 105 MBytes 881 Mbits/sec [ 1] 3.0000-4.0000 sec 105 MBytes 879 Mbits/sec [ 1] 4.0000-5.0000 sec 105 MBytes 878 Mbits/sec [ 1] 5.0000-6.0000 sec 105 MBytes 878 Mbits/sec [ 1] 6.0000-7.0000 sec 104 MBytes 875 Mbits/sec [ 1] 7.0000-8.0000 sec 104 MBytes 875 Mbits/sec [ 1] 8.0000-9.0000 sec 104 MBytes 873 Mbits/sec [ 1] 9.0000-10.0000 sec 104 MBytes 875 Mbits/sec [ 1] 0.0000-10.0073 sec 1.02 GBytes 875 Mbits/sec 2. Page pool with page_pool_release_page shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 ------------------------------------------------------------ Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416 KByte (WARNING: requested 1.91 MByte) ------------------------------------------------------------ [ 1] local 10.81.17.20 port 35924 connected with 10.81.16.245 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-1.0000 sec 101 MBytes 849 Mbits/sec [ 1] 1.0000-2.0000 sec 102 MBytes 860 Mbits/sec [ 1] 2.0000-3.0000 sec 102 MBytes 860 Mbits/sec [ 1] 3.0000-4.0000 sec 102 MBytes 859 Mbits/sec [ 1] 4.0000-5.0000 sec 103 MBytes 863 Mbits/sec [ 1] 5.0000-6.0000 sec 103 MBytes 864 Mbits/sec [ 1] 6.0000-7.0000 sec 103 MBytes 863 Mbits/sec [ 1] 7.0000-8.0000 sec 103 MBytes 865 Mbits/sec [ 1] 8.0000-9.0000 sec 103 MBytes 862 Mbits/sec [ 1] 9.0000-10.0000 sec 102 MBytes 856 Mbits/sec [ 1] 0.0000-10.0246 sec 1.00 GBytes 858 Mbits/sec 3. page pool with skb_mark_for_recycle shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 ------------------------------------------------------------ Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416 KByte (WARNING: requested 1.91 MByte) ------------------------------------------------------------ [ 1] local 10.81.17.20 port 42724 connected with 10.81.16.245 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-1.0000 sec 111 MBytes 931 Mbits/sec [ 1] 1.0000-2.0000 sec 112 MBytes 935 Mbits/sec [ 1] 2.0000-3.0000 sec 111 MBytes 934 Mbits/sec [ 1] 3.0000-4.0000 sec 111 MBytes 934 Mbits/sec [ 1] 4.0000-5.0000 sec 111 MBytes 934 Mbits/sec [ 1] 5.0000-6.0000 sec 112 MBytes 935 Mbits/sec [ 1] 6.0000-7.0000 sec 111 MBytes 934 Mbits/sec [ 1] 7.0000-8.0000 sec 111 MBytes 933 Mbits/sec [ 1] 8.0000-9.0000 sec 112 MBytes 935 Mbits/sec [ 1] 9.0000-10.0000 sec 111 MBytes 933 Mbits/sec [ 1] 0.0000-10.0069 sec 1.09 GBytes 934 Mbits/secThis is a very significant performance improvement (page pool with skb_mark_for_recycle). This is very close to the max goodput for a 1Gbit/s link.quoted
For small packet size (64 bytes), all three cases have almost the same result:To me this indicate, that the DMA map/unmap operations on this platform are indeed more expensive on larger packets. Given this is what page_pool does, keeping the DMA mapping intact when recycling. Driver still need DMA-sync, although I notice you set page_pool feature flag PP_FLAG_DMA_SYNC_DEV, this is good as page_pool will try to reduce sync size where possible. E.g. in this SKB case will reduce the DMA-sync to the max_len=FEC_ENET_RX_FRSIZE which should also help on performance.quoted
shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64 ------------------------------------------------------------ Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416 KByte (WARNING: requested 1.91 MByte) ------------------------------------------------------------ [ 1] local 10.81.17.20 port 58204 connected with 10.81.16.245 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.0000-1.0000 sec 36.9 MBytes 309 Mbits/sec [ 1] 1.0000-2.0000 sec 36.6 MBytes 307 Mbits/sec [ 1] 2.0000-3.0000 sec 36.6 MBytes 307 Mbits/sec [ 1] 3.0000-4.0000 sec 36.5 MBytes 307 Mbits/sec [ 1] 4.0000-5.0000 sec 37.1 MBytes 311 Mbits/sec [ 1] 5.0000-6.0000 sec 37.2 MBytes 312 Mbits/sec [ 1] 6.0000-7.0000 sec 37.1 MBytes 311 Mbits/sec [ 1] 7.0000-8.0000 sec 37.1 MBytes 311 Mbits/sec [ 1] 8.0000-9.0000 sec 37.1 MBytes 312 Mbits/sec [ 1] 9.0000-10.0000 sec 37.2 MBytes 312 Mbits/sec [ 1] 0.0000-10.0097 sec 369 MBytes 310 Mbits/sec Regards, Shenweiquoted
quoted
quoted
By small packets, do you mean those under the copybreak limit? Please provide some benchmark numbers with your next patchset.Yes, the packet size is 64 bytes and it is under the copybreak limit. As the impact is not significant, I would prefer to remove the copybreak logic.+1 to removing this logic if possible, due to maintenance cost. --Jesper