Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2023-05-16 17:59:44
Also in:
dri-devel, linux-iommu, lkml
On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote:
On Tue, 16 May 2023 08:13:09 +0200 Christoph Hellwig [off-list ref] wrote:quoted
On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:quoted
FWIW, I don't think the approach you have implemented here will be practical to use for CoCo VMs (SEV, TDX, whatever else). The problem is that dma_direct_alloc_pages() and dma_direct_free_pages() must call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo VMs, these calls are expensive because they require a hypercall to the host, and the operation on the host isn't trivial either. I haven't measured the overhead, but doing a hypercall on every DMA map operation and on every unmap operation has long been something we thought we must avoid. The fixed swiotlb bounce buffer space solves this problem by doing set_decrypted() in batch at boot time, and never doing set_encrypted().I also suspect it doesn't really scale too well due to the number of allocations. I suspect a better way to implement things would be to add more large chunks that are used just like the main swiotlb buffers. That is when we run out of space try to allocate another chunk of the same size in the background, similar to what we do with the pool in dma-pool.c. This means we'll do a fairly large allocation, so we'll need compaction or even CMA to back it up, but the other big upside is that it also reduces the number of buffers that need to be checked in is_swiotlb_buffer or the free / sync side.I have considered this approach. The two main issues I ran into were: 1. MAX_ORDER allocations were too small (at least with 4K pages), and even then they would often fail. 2. Allocating from CMA did work but only from process context. I made a stab at modifying the CMA allocator to work from interrupt context, but there are non-trivial interactions with the buddy allocator. Making them safe from interrupt context looked like a major task.
Can you kick off a worker thread when the swiotlb allocation gets past some reserve limit? It still has a risk of failing to bounce until the swiotlb buffer is extended.
I also had some fears about the length of the dynamic buffer list. I observed maximum length for block devices, and then it roughly followed the queue depth. Walking a few hundred buffers was still fast enough. I admit the list length may become an issue with high-end NVMe and I/O-intensive applications.
You could replace the list with an rbtree, O(log n) look-up vs O(n), could be faster if you have many bounces active. -- Catalin