Re: VFIO v2 design plan

From: David Gibson <hidden>
Date: 2011-09-01 04:29:18
Also in: linuxppc-dev, qemu-devel

On Tue, Aug 30, 2011 at 08:51:38AM -0600, Alex Williamson wrote:

On Tue, 2011-08-30 at 17:48 +1000, David Gibson wrote:

quoted

On Mon, Aug 29, 2011 at 10:24:43PM -0600, Alex Williamson wrote:

quoted

On Tue, 2011-08-30 at 13:04 +1000, David Gibson wrote:

quoted

On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote:

[snip]

quoted

 Once the group is viable, the user may bind the
group to another group, retrieve the iommu fd, or retrieve device fds.
Internally, each of these operations will result in an iommu domain
being allocated and all of the devices attached to the domain.

The purpose of binding groups is to share the iommu domain.  Groups
making use of incompatible iommu domains will fail to bind.  Groups
making use of different mm's will fail to bind.  The vfio driver may
reject some binding based on domain capabilities, but final veto power
is left to the iommu driver[1].  If a user makes use of a group
independently and later wishes to bind it to another group, all the
device fds and the iommu fd must first be closed.  This prevents using a
stale iommu fd or accessing devices while the iommu is being switched.
Operations on any group fds of a merged group are performed globally on
the group (ie. enumerating the devices lists all devices in the merged
group, retrieving the iommu fd from any group fd results in the same fd,
device fds from any group can be retrieved from any group fd[2]).
Groups can be merged and unmerged dynamically.  Unmerging a group
requires the device fds for the outgoing group are closed.  The iommu fd
will remain persistent for the remaining merged group.

As I've said I prefer a persistent group model, rather than this
transient group model, but it's not a dealbreaker by itself.  How are
unmerges specified?

VFIO_GROUP_UNMERGE ioctl taking a group fd parameter.

quoted

 I'm also assuming that in this model closing a
(bound) group fd will unmerge everything down to atomic groups again.

Yes, it will unmerge the closed group down to the atomic group.

Hrm, not thrilled with the merging semantics, but I can probably live
with them.  Still some clarifications, though..

If you open a group, merge in a bunch of other groups, then re-open
/dev/vfio/NNN for one of the groups mergeed, presumably the new fd
must also see the merged group?  So presumably you must only unmerge
everything when all the fds are closed.

The device fds for the group to be unmerged must be closed before an
unmerge.  The iommu fd is tricky.  The iommu fd is really the iommu for
the merged group, not the individual groups, so it's context stays with
the remaining group.  Therefore I don't enforce a refcnt on the iommu
fd.  The usage model I expect is that if a merge works, the user will
probably use a single iommu fd for the whole merged group.  Maybe that
should be enforced?

I thought I recalled you saying earlier that the iommu fd could not be
open when merging new groups in.  I would expect that also to be true
when unmerging in that case.

quoted

If you open groups a and b, then merge a (disjoint) bunch of things
into each, then merge b into a, what are the semantics?  Wheat about
if you then unmerge b from a - does it just remove the atomic group b,
or everything you merged into b earlier?  Or, what happens if you open
group a, merge in some things, then attempt to unmerge a from the
merged group?

Simple, don't allow merging and unmerging of merged groups.  Merge and
unmerge only work on singleton groups.

Ok, in that case I think we should call it "add" and "remove" rather
than merge and unmerge.

 The last case we must support.
In that case you just use:

ioctl(a.fd, VFIO_GROUP_MERGE, b.fd)
ioctl(b.fd, VFIO_GROUP_UNMERGE, a.fd)

The groups are peers when merged, so b can remove a as easily as a can
remove b.  Group b is left with any iommu context setup while
merged.

Um *goes cross-eyed*.  So, if you open (atomic) groups a and b, then
add group b to a, are the two open fds now functionally identical?
And likewise if you then open either a or b again straight from from
/dev/vfio?

Except, that the b fd must then retain the fact that it was originally
for atomic group (b), so that it can be used as a handle for an
unmerge/remove.

The more I dig into the details of these semantics the more I dislike
them.

[snip]

quoted

Beyond unbind, we also need to think about hotplug.  If a system had
multiple hotplug slots below a P2P bridge and a device was added while
the group is in use, what do we do?  Maybe we can somehow disable it or
mark it for vfio in our bus notifier routines(?).

That is a very good point.  It actually brings into focus a niggling
concern I had about this model where the group becomes vfio usable
once all the devices in it are bound to vfio.  Because of the
possibility of hotplug, I think its conceptually more correct to not
treat vfio as just another kernel driver which can bind devices, but a
special state that the whole group goes into atomically.  So the
sequence would be:
	- Admin asks that a group go into vfio state
	- kernel (attempts to) unbind kernel drivers from every device
in the group
	- group is marked in vfio/limbo state

At this point no kernel drivers may bind to anything in the group,
including things that are hotplugged into the group after this initial
sequence.

It seems like this is a mode that could only be accessible if the group
is opened w/ admin capabilities, I don't think we'd want to let the vfio
group chrdev owner be able to do that automatically.

They have to do something that's just as restrictive automatically.
If new devices enter an atomic group that's in use by a guest, the
kernel must not bind drivers to them.  I'm just trying to make the
semantics clearer, than proxying the restrictions by binding a dummy
driver to everything.

 I don't know of
any other drivers that behave like this, being able to unbind running
drivers and pull devices into itself.

Well, it's not a driver behaving like this, it's an explicit admin
operation to unbind all drivers from the whole group and put it in a
state that's suitable for vfio assignment.

quoted

 If the device fds are not released and
subsequently the iommu fd released as well, vfio will kill the user
process after some delay.

Ouch, this seems to me a problematic semantic.  Whether the user
process survives depends on whether it processes the remove requests
fast enough - and a user process could be slowed down by system load
or other factors not entirely in its control.

I was assuming "ample" time to process a hot remove, but yes, it's an
area of concern.  I'm not sure how much of a problem it is in practice
though.  Yes you can shoot your VM accidentally as root... don't do
that.

They can, but with this semantic they can't know in advance whether
the command is going to kill the VM or not.  I can just see a
situation where the admin issues a command to remove the device from
the guest, and usually that goes through the hot guest unplug
mechanisms, the guest keeps running and everything is happy.  Then one
time they issue *exactly the same command* and the VM dies, because
the system is running really slow for some reason (huge load, or maybe
someone switched the VM into full emulation for debugging).

Not sure how to handle this other than leave a trail of bread crumbs.

I have no idea what you mean by that.

quoted

I'd be more comfortable with a model where there was a distinction
between a "soft" and "hard" remove.  The soft would either simply
fail, if the device is in use by vfio, or block indefinitely.  The
hard would kill the user process without delay.  This effectively
allows your semantics to be implemented in userspace (soft remove,
wait, hard remove) - where it's easier to tweak the policy of how long
to wait.

Your first example is essentially what current vfio does now, request
remove, wait indefinitely and qemu triggers an abort if the guest
doesn't respond.  The trouble with moving this policy to userspace is
that we're not protecting the host.

How is the host not protected?  Bear in mind that when I say
"userspace" I'm not thinking qemu, I'm thinking the admin equipped
with whatever tools he uses for moving devices between guests.  So
they go:
	- Please remove this group from the guest
	- Waits for an amount of time of their choice
	- Decide, crap, the guest is broken
	- Hard remove the group from the guest, killing the guest

It's basic in perfect analogy to the old:
	- kill -15
	- *drum fingers*
	- Damn, it's stuck
	- kill -9

And what if the remove is initiated by a hardware admin that walks over
to the system, and presses the PCI device hot unplug doorbell?  It just
looks like a driver hang.  Thanks,

Hm, true.  How is this case handled on the host side?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help