Thread (12 messages) 12 messages, 7 authors, 12d ago

Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket

From: Eric Dumazet <edumazet@google.com>
Date: 2026-06-16 04:17:49
Also in: driver-core, linux-iommu, lkml

On Mon, Jun 15, 2026 at 4:42 PM Luigi Rizzo [off-list ref] wrote:
The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
especially with greedy senders, this has a high chance of happening in
the softirq handler for tx network interrupts, creating a significant
performance bottleneck.

Allow tx sockets to allocate socket buffers directly from the bounce
buffers. This avoids the second copy and removes the above bottleneck.
The fraction of swiotlb buffers allowed for this feature is set with
   /sys/module/swiotlb/parameters/zerocopy_tx_percent
Strange name, because your patch targets the regular tcp sendmsg()
path (with a user -> kernel copy).

Typical high performance RPC libraries use TCP TX zerocopy these days.
They won't benefit from this idea.
Perhaps you should state this in your changelog or documentation.

Also, what is the typical size of the bounce buffers in your guests?

With standard tcp_wmem settings, each TCP flow can consume 4 MB.

(0 means disabled, 90 is the maximum, to avoid persistent I/O failures).

Implementation:
- define a new page type to unambiguously identify bounce buffers used
  as backing storage for socket buffers
- modify skb_page_frag_refill to perform the modified allocation
- modify the destructors __free_frozen_pages(), free_unref_folio() to
  handle those pages and return them to the pool.

The savings are especially visible with fewer queues. In synthetic
benchmarks, senders with 1-2 queues would cap around 50Gbps with
conventional swiotlb, and reach over 170Gbps with the feature enabled.
This patch is too large; please split it into smaller functional
units, so that each domain experts
can focus on their part.

I see you test SOCK_ZEROCOPY, but some applications setting this flag
can mix tcp sendmsg() with or without zero-copy.

I also see your patch missed CONFIG_PREEMPT_RT case.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help