RE: [RFC 1/7] swiotlb: Introduce swiotlb throttling

From: Michael Kelley <hidden>
Date: 2024-08-27 17:31:02
Also in: linux-coco, linux-iommu, linux-nvme, linux-scsi, lkml

From: Petr Tesařík <redacted> Sent: Tuesday, August 27, 2024 8:56 AM

On Fri, 23 Aug 2024 20:41:15 +0000
Michael Kelley [off-list ref] wrote:

quoted

From: Petr Tesařík <redacted> Sent: Friday, August 23, 2024 12:41 AM

quoted

On Thu, 22 Aug 2024 11:37:12 -0700
mhkelley58@gmail.com wrote:

[...]

quoted

@@ -71,12 +72,15 @@
  *		from each index.
  * @pad_slots:	Number of preceding padding slots. Valid only in the first
  *		allocated non-padding slot.
+ * @throttled:  Boolean indicating the slot is used by a request that was
+ *		throttled. Valid only in the first allocated non-padding slot.
  */
 struct io_tlb_slot {
 	phys_addr_t orig_addr;
 	size_t alloc_size;
 	unsigned short list;
-	unsigned short pad_slots;
+	u8 pad_slots;
+	u8 throttled;

I'm not sure this flag is needed for each slot.

SWIOTLB mappings should be throttled when the total SWIOTLB usage is
above a threshold. Conversely, it can be unthrottled when the total
usage goes below a threshold, and it should not matter if that happens
due to an unmap of the exact buffer which previously pushed the usage
over the edge, or due to an unmap of any other unrelated buffer.

I think I understand what you are proposing. But I don't see a way
to make it work without adding global synchronization beyond
the current atomic counter for the number of uI'm sed slabs. At a minimum
we would need a global spin lock instead of the atomic counter. The spin
lock would protect the (non-atomic) slab count along with some other
accounting, and that's more global references. As described in the
cover letter, I was trying to avoid doing that.

I have thought about this for a few days. And I'm still not convinced.
You have made it clear in multiple places that the threshold is a soft
limit, and there are many ways the SWIOTLB utilization may exceed the
threshold. In fact I'm not even 100% sure that an atomic counter is
needed, because the check is racy anyway.

Atomic operations are expensive at the memory bus level, particularly
in high CPU count systems with NUMA topologies. However,
maintaining an imprecise global count doesn't work because the
divergence from reality can become unbounded over time. The
alternative is to sum up all the per-area counters each time a
reasonably good global value is needed, and that can be expensive itself
with high area counts. A hybrid might be to maintain an imprecise global
count, but periodically update it by summing up all the per-area counters
so that the divergence from reality isn't unbounded.

Another task may increase
(or decrease) the counter between atomic_long_read(&mem->total_used)
and a subsequent down(&mem->throttle_sem).

I consider it a feature, not a flaw, because the real important checks
happen later while searching for free slots, and those are protected
with a spinlock.

quoted

If you can see how to do what you propose with just the current
atomic counter, please describe.

I think I'm certainly missing something obvious, but let me open the
discussion to improve my understanding of the matter.

Suppose we don't protect the slab count with anything. What is the
worst possible outcome? IIUC the worst scenario is that multiple tasks
unmap swiotlb buffers simultaneously and all of them believe that their
action made the total usage go below the low threshold, so all of them
try to release the semaphore.

That's obviously not good, but AFAICS all that's needed is a
test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
calling up(). Since up() would acquire the semaphore's spinlock, and
there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
look like too much overhead to me, especially if it ends up in the same
cache line as the semaphore.

Yes, the semaphore management is the problem. Presumably we want
each throttled request to wait on the semaphore, forming an ordered
queue of waiters. Each up() on the semaphore releases one of those
waiters. We don’t want to release all the waiters when the slab count
transitions from "above throttle" to "below throttle" because that
creates a thundering herd problem.

So consider this example scenario:
1) Two waiters ("A" and "B") are queued the semaphore, each wanting 2 slabs.
2) An unrelated swiotlb unmap frees 10 slabs, taking the slab count
from 2 above threshold to 8 below threshold. This does up() on
the semaphore and awakens "A".
3) "A" does his request for 2 slabs, and the slab count is now 6 below
threshold.
4) "A" does swiotlb unmap.  The slab count goes from 6 below threshold back
to 8 below threshold, so no semaphore operation is done. "B" is still waiting.
5) System-wide, swiotlb requests decline, and the slab count never goes above
the threshold again. At this point, "B" is still waiting and never gets awakened.

An ordered queue of waiters is incompatible with wakeups determined solely
on whether the slab count is below the threshold after swiotlb unmap. You
would have to wait up all waiters and let them re-contend for the slots that
are available below the threshold, with most probably losing out and going
back on the semaphore wait queue (i.e., a thundering herd).

Separately, what does a swiotlb unmap do if it takes the slab count from above
threshold to below threshold, and there are no waiters? It should not do up()
in that case, but how can it make that decision in a way that doesn't race
with a swiotlb map operation running at the same time?

Michael

Besides, this all happens only in the slow path, i.e. only after the
current utilization has just dropped below the low threshold, not if
the utilization was already below the threshold before freeing up some
slots.

I have briefly considered subtracting the low threshold as initial bias
from mem->total_used and using atomic_long_add_negative() to avoid the
need for an extra throttled flag, but at this point I'm not sure it can
be implemented without any races. We can try to figure out the details
if it sounds like a good idea.

Petr T

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help