Thread (23 messages) 23 messages, 3 authors, 2021-08-26

Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support

From: David Ahern <hidden>
Date: 2021-08-24 03:34:29
Also in: lkml, netdev

On 8/22/21 9:32 PM, Yunsheng Lin wrote:
I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
bottleneck?
yes.
It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
is not changed when testing, MTU is 1500:
-Z == sendfile API. That works fine to a point and that point is well
below 100G.

I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.
IOMMU in strict mode:
1. Tx ZC case:
   22Gbit with Tx being bottleneck(cpu bound)
2. Tx non-ZC case with pfrag pool enabled:
   40Git with Rx being bottleneck(cpu bound)
3. Tx non-ZC case with pfrag pool disabled:
   30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
   not have a single CPU reaching about 100% usage.
quoted
At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
on throughput since the Rx is 100% cpu.
As above performance data, enabling ZC does not seems to help when IOMMU
is involved, which has about 30% performance degrade when pfrag pool is
disabled and 50% performance degrade when pfrag pool is enabled.
In a past response you should numbers for Tx ZC API with a custom
program. That program showed the dramatic reduction in CPU cycles for Tx
with the ZC API.
quoted
At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
reduces Rx processing and lower CPU to process the incoming stream. Then
using the Tx ZC API you lower the Tx overehad allowing a single stream
to faster - sending more data which in the end results in much higher
pps and throughput. At the limit you are CPU bound (both ends in my
testing as Rx side approaches the max pps, and Tx side as it continually
tries to send data).

Lowering CPU usage on Tx the side is a win regardless of whether there
is a big increase on the throughput at 1500 MTU since that configuration
is an Rx CPU bound problem. Hence, my point that we have a good start
point for lowering CPU usage on the Tx side; we should improve it rather
than add per-socket page pools.
Acctually it is not a per-socket page pools, the page pool is still per
NAPI, this patchset adds multi allocation context to the page pool, so that
the tx can reuse the same page pool with rx, which is quite usefully if the
ARFS is enabled.
quoted
You can stress the Tx side and emphasize its overhead by modifying the
receiver to drop the data on Rx rather than copy to userspace which is a
huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
As the frag page is supported in page pool for Rx, the Rx probably is not
a bottleneck any more, at least not for IOMMU in strict mode.

It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?
https://github.com/dsahern/iperf, mods branch

--zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.

quoted
stream to go faster and emphasize Tx bottlenecks as the pps at 3300
approaches the top pps at 1500. e.g., doing this with iperf3 shows the
spinlock overhead with tcp_sendmsg, overhead related to 'select' and
then gup_pgd_range.
When IOMMU is in strict mode, the overhead with IOMMU seems to be much
bigger than spinlock(23% to 10%).

Anyway, I still think ZC mostly benefit to packet which is bigger than a
specific size and IOMMU disabling case.

quoted
.
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help