Thread (233 messages) 233 messages, 15 authors, 2021-10-28

Re: [RFC] /dev/ioasid uAPI proposal

From: David Gibson <hidden>
Date: 2021-06-08 06:54:47
Also in: linux-iommu, lkml

On Thu, Jun 03, 2021 at 09:11:05AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
quoted
On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
quoted
On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
quoted
quoted
quoted
	/* Bind guest I/O page table  */
	bind_data = {
		.ioasid	= gva_ioasid;
		.addr	= gva_pgtable1;
		// and format information
	};
	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..
I'm pretty sure the device(s) could matter, although they probably
won't usually. 
It is a bit subtle, but the /dev/iommu fd itself is connected to the
devices first. This prevents wildly incompatible devices from being
joined together, and allows some "get info" to report the capability
union of all devices if we want to do that.
Right.. but I've not been convinced that having a /dev/iommu fd
instance be the boundary for these types of things actually makes
sense.  For example if we were doing the preregistration thing
(whether by child ASes or otherwise) then that still makes sense
across wildly different devices, but we couldn't share that layer if
we have to open different instances for each of them.
It is something that still seems up in the air.. What seems clear for
/dev/iommu is that it
 - holds a bunch of IOASID's organized into a tree
 - holds a bunch of connected devices
Right, and it's still not really clear to me what devices connected to
the same /dev/iommu instance really need to have in common, as
distinct from what devices connected to the same specific ioasid need
to have in common.
 - holds a pinned memory cache

One thing it must do is enforce IOMMU group security. A device cannot
be attached to an IOASID unless all devices in its IOMMU group are
part of the same /dev/iommu FD.
Well, you can't attach a device to an individual IOASID unless all
devices in its group are attached to the same individual IOASID
either, so I'm not clear what benefit there is to enforcing it at the
/dev/iommu instance as well as at the individual ioasid level.
The big open question is what parameters govern allowing devices to
connect to the /dev/iommu:
 - all devices can connect and we model the differences inside the API
   somehow.
 - Only sufficiently "similar" devices can be connected
 - The FD's capability is the minimum of all the connected devices

There are some practical problems here, when an IOASID is created the
kernel does need to allocate a page table for it, and that has to be
in some definite format.

It may be that we had a false start thinking the FD container should
be limited. Perhaps creating an IOASID should pass in a list
of the "device labels" that the IOASID will be used with and that can
guide the kernel what to do?
quoted
Right, but at this stage I'm just not seeing a really clear (across
platforms and device typpes) boundary for what things have to be per
IOASID container and what have to be per IOASID, so I'm just not sure
the /dev/iommu instance grouping makes any sense.
I would push as much stuff as possible to be per-IOASID..
I agree.  But the question is what's *not* possible to be per-IOASID,
so what's the semantic boundary that defines when things have to be in
the same /dev/iommu instance, but not the same IOASID.
quoted
quoted
I don't know if that small advantage is worth the extra complexity
though.
quoted
But it would certainly be possible for a system to have two
different host bridges with two different IOMMUs with different
pagetable formats.  Until you know which devices (and therefore
which host bridge) you're talking about, you don't know what formats
of pagetable to accept.  And if you have devices from *both* bridges
you can't bind a page table at all - you could theoretically support
a kernel managed pagetable by mirroring each MAP and UNMAP to tables
in both formats, but it would be pretty reasonable not to support
that.
The basic process for a user space owned pgtable mode would be:

 1) qemu has to figure out what format of pgtable to use

    Presumably it uses query functions using the device label.
No... in the qemu case it would always select the page table format
that it needs to present to the guest.  That's part of the
guest-visible platform that's selected by qemu's configuration.
I should have said "vfio user" here because apps like DPDK might use
this path
Ok.
quoted
quoted
 4) For the next device qemu would have to figure out if it can re-use
    an existing IOASID based on the required proeprties.
Nope.  Again, what devices share an IO address space is a guest
visible part of the platform.  If the host kernel can't supply that,
then qemu must not start (or fail the hotplug if the new device is
being hotplugged).
qemu can always emulate.
No, not always, only sometimes.  The host side IOVA has to be able to
process all the IOVAs that the guest might generate, and it needs to
have an equal or smaller pagesize than the guest expects.
If the config requires to devices that cannot
share an IOASID because the local platform is wonky then qemu needs to
shadow and duplicate the IO page table from the guest into two IOASID
objects to make it work. This is a SW emulation option.
quoted
For this reason, amongst some others, I think when selecting a kernel
managed pagetable we need to also have userspace explicitly request
which IOVA ranges are mappable, and what (minimum) page size it
needs.
It does make sense

Jason
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help