Re: [PATCH net-next 0/6] page_pool: recycle buffers
From: Alexander Lobakin <hidden>
Date: 2021-03-23 20:04:57
Also in:
lkml
From: Ilias Apalodimas <ilias.apalodimas@linaro.org> Date: Tue, 23 Mar 2021 19:01:52 +0200
On Tue, Mar 23, 2021 at 04:55:31PM +0000, Alexander Lobakin wrote:quoted
quoted
quoted
quoted
quoted
quoted
[...]quoted
quoted
quoted
quoted
quoted
Thanks for the testing! Any chance you can get a perf measurement on this?I guess you mean perf-report (--stdio) output, right?Yea, As hinted below, I am just trying to figure out if on Alexander's platform the cost of syncing, is bigger that free-allocate. I remember one armv7 were that was the case.quoted
quoted
Is DMA syncing taking a substantial amount of your cpu usage?(+1 this is an important question)Sure, I'll drop perf tools to my test env and share the results, maybe tomorrow or in a few days.
Oh we-e-e-ell...
Looks like I've been fooled by I-cache misses or smth like that.
That happens sometimes, not only on my machines, and not only on
MIPS if I'm not mistaken.
Sorry for confusing you guys.
I got drastically different numbers after I enabled CONFIG_KALLSYMS +
CONFIG_PERF_EVENTS for perf tools.
The only difference in code is that I rebased onto Mel's
mm-bulk-rebase-v6r4.
(lunar is my WIP NIC driver)
1. 5.12-rc3 baseline:
TCP: 566 Mbps
UDP: 615 Mbps
perf top:
4.44% [lunar] [k] lunar_rx_poll_page_pool
3.56% [kernel] [k] r4k_wait_irqoff
2.89% [kernel] [k] free_unref_page
2.57% [kernel] [k] dma_map_page_attrs
2.32% [kernel] [k] get_page_from_freelist
2.28% [lunar] [k] lunar_start_xmit
1.82% [kernel] [k] __copy_user
1.75% [kernel] [k] dev_gro_receive
1.52% [kernel] [k] cpuidle_enter_state_coupled
1.46% [kernel] [k] tcp_gro_receive
1.35% [kernel] [k] __rmemcpy
1.33% [nf_conntrack] [k] nf_conntrack_tcp_packet
1.30% [kernel] [k] __dev_queue_xmit
1.22% [kernel] [k] pfifo_fast_dequeue
1.17% [kernel] [k] skb_release_data
1.17% [kernel] [k] skb_segment
free_unref_page() and get_page_from_freelist() consume a lot.
2. 5.12-rc3 + Page Pool recycling by Matteo:
TCP: 589 Mbps
UDP: 633 Mbps
perf top:
4.27% [lunar] [k] lunar_rx_poll_page_pool
2.68% [lunar] [k] lunar_start_xmit
2.41% [kernel] [k] dma_map_page_attrs
1.92% [kernel] [k] r4k_wait_irqoff
1.89% [kernel] [k] __copy_user
1.62% [kernel] [k] dev_gro_receive
1.51% [kernel] [k] cpuidle_enter_state_coupled
1.44% [kernel] [k] tcp_gro_receive
1.40% [kernel] [k] __rmemcpy
1.38% [nf_conntrack] [k] nf_conntrack_tcp_packet
1.37% [kernel] [k] free_unref_page
1.35% [kernel] [k] __dev_queue_xmit
1.30% [kernel] [k] skb_segment
1.28% [kernel] [k] get_page_from_freelist
1.27% [kernel] [k] r4k_dma_cache_inv
+20 Mbps increase on both TCP and UDP. free_unref_page() and
get_page_from_freelist() dropped down the list significantly.
3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
TCP: 596
UDP: 641
perf top:
4.38% [lunar] [k] lunar_rx_poll_page_pool
3.34% [kernel] [k] r4k_wait_irqoff
3.14% [kernel] [k] dma_map_page_attrs
2.49% [lunar] [k] lunar_start_xmit
1.85% [kernel] [k] dev_gro_receive
1.76% [kernel] [k] free_unref_page
1.76% [kernel] [k] __copy_user
1.65% [kernel] [k] inet_gro_receive
1.57% [kernel] [k] tcp_gro_receive
1.48% [kernel] [k] cpuidle_enter_state_coupled
1.43% [nf_conntrack] [k] nf_conntrack_tcp_packet
1.42% [kernel] [k] __rmemcpy
1.25% [kernel] [k] skb_segment
1.21% [kernel] [k] r4k_dma_cache_inv
+10 Mbps on top of recycling.
get_page_from_freelist() is gone.
NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
routine became the top consumers.
4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(),
removing 'noinline' from net/core/page_pool.c etc.
...makes absolutely no sense anymore.
I see Mel took Jesper's patch to make __rmqueue_pcplist() inline into
mm-bulk-rebase-v6r5, not sure if it's really needed now.
So I'm really glad we sorted out the things and I can see the real
performance improvements from both recycling and bulk allocations.
quoted
From what I know for sure about MIPS and my platform, post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive. I always have sane page_pool->pp.max_len value (smth about 1668 for MTU of 1500) to minimize the overhead. By the word, IIRC, all machines shipped with mvpp2 have hardware cache coherency units and don't suffer from sync routines at all. That may be the reason why mvpp2 wins the most from this series.Yep exactly. It's also the reason why you explicitly have to opt-in using the recycling (by marking the skb for it), instead of hiding the feature in the page pool internals Cheers /Iliasquoted
quoted
quoted
quoted
quoted
That would be the same as for mvneta: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2, and I get lower packet rate than recycling alone. I don't know why, we should investigate it.mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path. As a result, NAPI percpu caches get refilled only through kmem_cache_alloc_bulk(), and most of skbuff_head recycling doesn't work.quoted
Regards, -- per aspera ad upstreamOh, I love that one! Al
Thanks, Al