Thread (101 messages) 101 messages, 15 authors, 2017-06-23
STALE3262d

[RFC PATCH 04/30] iommu/arm-smmu-v3: Add support for PCI ATS

From: Roy Franz Cavium <hidden>
Date: 2017-05-25 18:27:18
Also in: kvm, linux-iommu, linux-pci

On Tue, May 23, 2017 at 4:21 AM, Jean-Philippe Brucker
[off-list ref] wrote:
On 23/05/17 09:41, Leizhen (ThunderTown) wrote:
quoted
On 2017/2/28 3:54, Jean-Philippe Brucker wrote:
quoted
PCIe devices can implement their own TLB, named Address Translation Cache
(ATC). Steps involved in the use and maintenance of such caches are:

* Device sends an Address Translation Request for a given IOVA to the
  IOMMU. If the translation succeeds, the IOMMU returns the corresponding
  physical address, which is stored in the device's ATC.

* Device can then use the physical address directly in a transaction.
  A PCIe device does so by setting the TLP AT field to 0b10 - translated.
  The SMMU might check that the device is allowed to send translated
  transactions, and let it pass through.

* When an address is unmapped, CPU sends a CMD_ATC_INV command to the
  SMMU, that is relayed to the device.

In theory, this doesn't require a lot of software intervention. The IOMMU
driver needs to enable ATS when adding a PCI device, and send an
invalidation request when unmapping. Note that this invalidation is
allowed to take up to a minute, according to the PCIe spec. In
addition, the invalidation queue on the ATC side is fairly small, 32 by
default, so we cannot keep many invalidations in flight (see ATS spec
section 3.5, Invalidate Flow Control).

Handling these constraints properly would require to postpone
invalidations, and keep the stale mappings until we're certain that all
devices forgot about them. This requires major work in the page table
managers, and is therefore not done by this patch.

  Range calculation
  -----------------

The invalidation packet itself is a bit awkward: range must be naturally
aligned, which means that the start address is a multiple of the range
size. In addition, the size must be a power of two number of 4k pages. We
have a few options to enforce this constraint:

(1) Find the smallest naturally aligned region that covers the requested
    range. This is simple to compute and only takes one ATC_INV, but it
    will spill on lots of neighbouring ATC entries.

(2) Align the start address to the region size (rounded up to a power of
    two), and send a second invalidation for the next range of the same
    size. Still not great, but reduces spilling.

(3) Cover the range exactly with the smallest number of naturally aligned
    regions. This would be interesting to implement but as for (2),
    requires multiple ATC_INV.

As I suspect ATC invalidation packets will be a very scarce resource,
we'll go with option (1) for now, and only send one big invalidation.

Note that with io-pgtable, the unmap function is called for each page, so
this doesn't matter. The problem shows up when sharing page tables with
the MMU.
Suppose this is true, I'd like to choose option (2). Because the worst cases of
both (1) and (2) will not be happened, but the code of (2) will look clearer.
And (2) is technically more acceptable.
I agree that (2) is a bit clearer, but the question is of performance
rather than readability. I'd like to see some benchmarks or experiment on
my own before switching to a two-invalidation system.

Intuitively one big invalidation will result in more ATC trashing and will
bring overall device performance down. But then according to the PCI spec,
ATC invalidations are grossly expensive, they have an upper bound of a
minute. I agree that this is highly improbable and might depend on the
range size, but purely from an architectural standpoint, reducing the
number of ATC invalidation requests is the priority, because this is much
worse than any performance slow-down incurred by ATC trashing. And for the
moment I can only base my decisions on the architecture.

So I'd like to keep (1) for now, and update it to (2) (or even (3)) once
we have more hardware to experiment with.

Thanks,
Jean
I think (1) is a good place to start, as the same restricted encoding
that is used in
the invalidations is also used in the translation responses - all of
the ATC entries
were created with regions described this way.  We still may end up with nothing
but STU sized ATC entries, as TAs are free to respond to large
translation requests
with multiple STU sized translations, and in some cases this is the
best that they
can do.  Picking the optimal strategy will depend on hardware, and
maybe workload
as well.

Thanks,
Roy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help