Thread (5 messages) 5 messages, 2 authors, 2026-03-30

Re: [PATCH net-next, v3] net: mana: Force full-page RX buffers for 4K page size on specific systems.

From: Dipayaan Roy <hidden>
Date: 2026-03-30 19:41:37
Also in: linux-hyperv, linux-rdma, lkml

On Fri, Mar 20, 2026 at 05:29:08PM -0700, Jakub Kicinski wrote:
On Fri, 20 Mar 2026 11:37:36 -0700 Dipayaan Roy wrote:
quoted
On Sat, Mar 14, 2026 at 12:50:53PM -0700, Jakub Kicinski wrote:
quoted
On Tue, 10 Mar 2026 21:00:49 -0700 Dipayaan Roy wrote:  
quoted
On certain systems configured with 4K PAGE_SIZE, utilizing page_pool
fragments for RX buffers results in a significant throughput regression.
Profiling reveals that this regression correlates with high overhead in the
fragment allocation and reference counting paths on these specific
platforms, rendering the multi-buffer-per-page strategy counterproductive.  
Can you say more ? We could technically take two references on the page
right away if MTU is small and avoid some of the cost.  
There is a 15-20% shortfall in achieving line rate for MANA (180+ Gbps)
on a particular ARM64 SKU. The issue is only specific to this processor SKU —
not seen on other ARM64 SKUs (e.g., GB200) or x86 SKUs. Critically, the
regression only manifests beyond 16 TCP connections, which strongly indicates
seen when there is  high contention and traffic.

  no. of     | rx buf backed       | rx buf backed
 connections | with page fragments | with full page
-------------+---------------------+---------------
           4 |         139 Gbps    |     138 Gbps
           8 |         140 Gbps    |     162 Gbps
          16 |         186 Gbps    |     186 Gbps
These results look at bit odd, 4 and 16 streams have the same perf,
while all other cases indeed show a delta. What I was hoping for was
a more precise attribution of the performance issue. Like perf top
showing that its indeed the atomic ops on the refcount that stall.
quoted
          32 |         136 Gbps    |     183 Gbps
          48 |         159 Gbps    |     185 Gbps
          64 |         165 Gbps    |     184 Gbps
         128 |         170 Gbps    |     180 Gbps
 
HW team is still working to RCA this hw behaviour.

Regarding "We could technically take two references on the page right
away", are you suggesting having page reference counting logic to driver
instead of relying on page pool?
Yes, either that or adjust the page pool APIs. 
page_pool_alloc_frag_netmem() currently sets the refcount to BIAS
which it then has to subtract later. So we get:

  set(BIAS)
  .. driver allocates chunks ..
  sub(BIAS_MAX - pool->frag_users)

Instead of using BIAS we could make the page pool guess that the caller
will keep asking for the same frame size. So initially take
(PAGE_SIZE/size) references.
Ok I will be doing some expeimentation with this approach to see if it
helps the current scenario.
quoted
quoted
The driver doesn't seem to set skb->truesize accordingly after this
change. So you're lying to the stack about how much memory each packet
consumes. This is a blocker for the change.
  
ACK. I will send out a separate patch with fixes tag to fix the skb true
size.
quoted
quoted
To mitigate this, bypass the page_pool fragment path and force a single RX
packet per page allocation when all the following conditions are met:
  1. The system is configured with a 4K PAGE_SIZE.
  2. A processor-specific quirk is detected via SMBIOS Type 4 data.  
I don't think we want the kernel to be in the business of carrying
matching on platform names and providing optimal config by default.
This sort of logic needs to live in user space or the hypervisor 
(which can then pass a single bit to the driver to enable the behavior)
  
As per our internal discussion the hypervisor cannot provide the CPU
version info(in vm as well as in bare metal offerings).
Why? I suppose it's much more effort for you but it's much more effort
for the community to carry the workaround. So..
As per the hypervisor team it is not solving the issue in the case of
bare metal offering, hence will work ahead with an alternate soultion
as suggested by you: "This sort of logic needs to live in user space..,
which can then pass a single bit to the driver to enable the behavior"
quoted
On handling it from user side are you suggesting it to introduce a new
ethtool Private Flags and have udev rules for the driver to set the private
flag and switch to full page rx buffers? Given that the wide number of distro
support this might be harder to maintain/backport. 

Also the dmi parsing design was influenced by other net wireleass
drivers as /wireless/ath/ath10k/core.c. If this approach is not
acceptable for MANA driver then will have to take a alternate route
based on the dsicussion right above it.
Plenty of ugly hacks in the kernel, it's no excuse.
Hi Jakub,

As we are still working on root causing the actual issue with HW team,
we would want the user a option to achieve the line rate by a tuneable
option to run with full page rx buffers. I will be sending out a next
version that would introduce an ethtool private flag for mana that
allows the user to force one RX buffer per page.


Regards
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help