Thread (28 messages) 28 messages, 5 authors, 2023-05-23

Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2023-05-16 17:59:44
Also in: dri-devel, linux-iommu, lkml

On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote:
On Tue, 16 May 2023 08:13:09 +0200
Christoph Hellwig [off-list ref] wrote:
quoted
On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote:
quoted
FWIW, I don't think the approach you have implemented here will be
practical to use for CoCo VMs (SEV, TDX, whatever else).  The problem
is that dma_direct_alloc_pages() and dma_direct_free_pages() must
call dma_set_decrypted() and dma_set_encrypted(), respectively.  In CoCo
VMs, these calls are expensive because they require a hypercall to the host,
and the operation on the host isn't trivial either.  I haven't measured the
overhead, but doing a hypercall on every DMA map operation and on
every unmap operation has long been something we thought we must
avoid.  The fixed swiotlb bounce buffer space solves this problem by
doing set_decrypted() in batch at boot time, and never
doing set_encrypted().  
I also suspect it doesn't really scale too well due to the number of
allocations.  I suspect a better way to implement things would be to
add more large chunks that are used just like the main swiotlb buffers.

That is when we run out of space try to allocate another chunk of the
same size in the background, similar to what we do with the pool in
dma-pool.c.  This means we'll do a fairly large allocation, so we'll
need compaction or even CMA to back it up, but the other big upside
is that it also reduces the number of buffers that need to be checked
in is_swiotlb_buffer or the free / sync side.
I have considered this approach. The two main issues I ran into were:

1. MAX_ORDER allocations were too small (at least with 4K pages), and
   even then they would often fail.

2. Allocating from CMA did work but only from process context.
   I made a stab at modifying the CMA allocator to work from interrupt
   context, but there are non-trivial interactions with the buddy
   allocator. Making them safe from interrupt context looked like a
   major task.
Can you kick off a worker thread when the swiotlb allocation gets past
some reserve limit? It still has a risk of failing to bounce until the
swiotlb buffer is extended.
I also had some fears about the length of the dynamic buffer list. I
observed maximum length for block devices, and then it roughly followed
the queue depth. Walking a few hundred buffers was still fast enough.
I admit the list length may become an issue with high-end NVMe and
I/O-intensive applications.
You could replace the list with an rbtree, O(log n) look-up vs O(n),
could be faster if you have many bounces active.

-- 
Catalin
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help