Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Eric Dumazet <edumazet@google.com>
Date: 2026-06-16 04:17:49
Also in:
driver-core, linux-iommu, lkml
On Mon, Jun 15, 2026 at 4:42 PM Luigi Rizzo [off-list ref] wrote:
The use of swiotlb causes an extra data copy on I/O. For tx sockets, especially with greedy senders, this has a high chance of happening in the softirq handler for tx network interrupts, creating a significant performance bottleneck. Allow tx sockets to allocate socket buffers directly from the bounce buffers. This avoids the second copy and removes the above bottleneck. The fraction of swiotlb buffers allowed for this feature is set with /sys/module/swiotlb/parameters/zerocopy_tx_percent
Strange name, because your patch targets the regular tcp sendmsg() path (with a user -> kernel copy). Typical high performance RPC libraries use TCP TX zerocopy these days. They won't benefit from this idea. Perhaps you should state this in your changelog or documentation. Also, what is the typical size of the bounce buffers in your guests? With standard tcp_wmem settings, each TCP flow can consume 4 MB.
(0 means disabled, 90 is the maximum, to avoid persistent I/O failures). Implementation: - define a new page type to unambiguously identify bounce buffers used as backing storage for socket buffers - modify skb_page_frag_refill to perform the modified allocation - modify the destructors __free_frozen_pages(), free_unref_folio() to handle those pages and return them to the pool. The savings are especially visible with fewer queues. In synthetic benchmarks, senders with 1-2 queues would cap around 50Gbps with conventional swiotlb, and reach over 170Gbps with the feature enabled.
This patch is too large; please split it into smaller functional units, so that each domain experts can focus on their part. I see you test SOCK_ZEROCOPY, but some applications setting this flag can mix tcp sendmsg() with or without zero-copy. I also see your patch missed CONFIG_PREEMPT_RT case.