Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

[PATCH net-next v3 0/4] introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-09
[PATCH net-next v3 2/4] page_pool: unify frag_count handling in page_pool_is_last_frag() · Yunsheng Lin <hidden> · 2023-06-09
[PATCH net-next v3 1/4] page_pool: frag API support for 32-bit arch with 64-bit DMA · Yunsheng Lin <hidden> · 2023-06-09
Re: [PATCH net-next v3 1/4] page_pool: frag API support for 32-bit arch with 64-bit DMA · Jesper Dangaard Brouer <hidden> · 2023-06-09
Re: [PATCH net-next v3 1/4] page_pool: frag API support for 32-bit arch with 64-bit DMA · Yunsheng Lin <hidden> · 2023-06-10
Re: [PATCH net-next v3 1/4] page_pool: frag API support for 32-bit arch with 64-bit DMA · Jesper Dangaard Brouer <hidden> · 2023-06-11
[PATCH net-next v3 4/4] page_pool: remove PP_FLAG_PAGE_FRAG flag · Yunsheng Lin <hidden> · 2023-06-09
[PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-09
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-13
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-14
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-14
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-15
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-15
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Jesper Dangaard Brouer <hidden> · 2023-06-15
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Jesper Dangaard Brouer <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Jesper Dangaard Brouer <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Jakub Kicinski <kuba@kernel.org> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Lorenzo Bianconi <lorenzo@kernel.org> · 2023-06-18
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-20
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Lorenzo Bianconi <lorenzo@kernel.org> · 2023-06-20
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Jesper Dangaard Brouer <hidden> · 2023-06-21
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-24
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-17
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-16
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-17
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Alexander Duyck <hidden> · 2023-06-20
Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API · Yunsheng Lin <hidden> · 2023-06-24

From: Jesper Dangaard Brouer <hidden>
Date: 2023-06-21 11:55:50
Also in: bpf, lkml


On 20/06/2023 23.16, Lorenzo Bianconi wrote:

[...]

quoted

I did some experiments using page_frag_cache/page_frag_alloc() instead of
page_pools in a simple environment I used to test XDP for veth driver.
In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
the page_frag_cache in order to copy the full skb in the new one, actually
"linearizing" the packet (since we know the original skb length).
I run an iperf TCP connection over a veth pair where the
remote device runs the xdp_rxq_info sample (available in the kernel source
tree, with action XDP_PASS):

TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server

net-next (page_pool):
- MTU 1500B: ~  7.5 Gbps
- MTU 8000B: ~ 15.3 Gbps

net-next + page_frag_alloc:
- MTU 1500B: ~  8.4 Gbps
- MTU 8000B: ~ 14.7 Gbps

It seems there is no a clear "win" situation here (at least in this environment
and we this simple approach). Moreover:

For the 1500B packets it is a win, but for 8000B it looks like there
is a regression. Any idea what is causing it?

nope, I have not looked into it yet.

I think I can explain via using micro-benchmark numbers.
(Lorenzo and I have discussed this over IRC, so this is our summary)

*** MTU 1500***

* The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):

The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).

The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
amortized cost per packet is only 34.3 cycles (514/15).

Thus, this explains why page_frag_alloc API have an advantage here, as
amortized cost per packet is lower (for page_frag_alloc).


*** MTU 8000 ***

* The MTU 8000 case, where PP is faster than page_frag_alloc.

The page_frag_alloc API cannot slice the same 32KB into as many packets.
The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
only store 3 full packets (32768/8590 = 3.81).
Thus, cost is 514/3 = 171 cycles.

The PP is actually challenged at MTU 8000, because it unfortunately
leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
bytes. Thus cost is 3x 48 cycles = 144 cycles.
(There is also a chance of Jakubs "allow_direct" optimization in 
page_pool_return_skb_page to increase performance for PP).

Thus, this explains why PP is fastest in this case.


*** Surprising insights ***

My (maybe) surprising conclusion is that we should combine the two
approaches.  Which is basically what Lin's patchset is doing!
Thus, I'm actually suddenly become a fan of this patchset...

The insight is that PP can also work with higher-order pages and the
cost of PP recycles via ptr_ring will be the same, regardless of page
order size.  Thus, we can reduced the order-3 cost 514 cycles to
basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
amortized allocator cost 48/15 = 3.2 cycles.

On the PP alloc-side this will be amazingly fast. When PP recycles frags
side, see page_pool_defrag_page() there is an atomic_sub operation.
I've measured atomic_inc to cost 17 cycles (for optimal non-contended
case), thus 3+17 = 20 cycles, it should still be a win.


--Jesper

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help