Re: [PATCH RFC v2] vfio: Documentation for the migration region

From: Cornelia Huck <cohuck@redhat.com>
Date: 2021-12-06 16:05:23
Also in: kvm

On Fri, Dec 03 2021, Alex Williamson [off-list ref] wrote:

On Wed, 1 Dec 2021 19:25:02 -0400
Jason Gunthorpe [off-list ref] wrote:

quoted

On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:

quoted

On Tue, 30 Nov 2021 23:14:07 -0400
Jason Gunthorpe [off-list ref] wrote:

quoted

On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

quoted

OTOH "supportable" qemu could certainly make the default choice to
require devices for simplicity.

I get a bit lost every time I try to sketch out how QEMU would
implement it.  Forgive the stream of consciousness and rhetorical
discussion below...

 - Does it make sense that a device itself can opt-out of p2p mappings?
   This is simply an MMIO access from another device rather than from
   the CPU.  Vendors cannot preclude this use case on bare metal,
   should they be able to in a VM?  We create heterogeneous p2p maps,
   device A can access device B, but device B maybe can't access device
   A.  Seems troublesome.

 - If we can't have a per-device policy, then we'd need a per VM
   policy, likely some way to opt-out of all p2p mappings for vfio
   devices.  We need to support hotplug of devices though, so is it a
   per VM policy or is it a per device policy which needs to be
   consistent among all attached devices?  Perhaps a
   "enable_p2p=on/off" option on the vfio-pci device, default [on] to
   match current behavior.  For any case of this option being set to
   non-default, all devices would need to set it to the same value,
   non-compliant devices rejected.

 - We could possibly allow migration=on,enable_p2p=on for a non-NDMA
   device, but the rules change if additional devices are added, they
   need to be rejected or migration support implicitly disappears.  That
   seems hard to document, non-deterministic as far as a user is
   concerned. So maybe for a non-NDMA device we'd require
   enable_p2p=off in order to set migration=on.  That means we could
   never enable migration on non-NDMA devices by default, which
   probably means we also cannot enable it by default on NDMA devices
   or we get user confusion/inconsistency.

 - Can a user know which devices will require enable_p2p=off in order
   to set migration=on?  "Read the error log" is a poor user experience
   and difficult hurdle for libvirt.

So in order to create a predictable QEMU experience in the face of
optional NDMA per device, I think we preclude being able to enable
migration support for any vfio device by default and we have an
exercise to determine how a user or management tool could easily
determine NDMA compatibility :-\

Hm, maybe we can take a page out of the confidential guest support book
and control this like we do for the virtio iommu flag?

I'm not sure whether there is a pressing need to support p2p for
non-NDMA devices?

There's a fine line between writing an inter-operable driver and
writing a driver for the current QEMU implementation.  Obviously we want
to support the current QEMU implementation, but we want an interface
that can accommodate how that implementation might evolve.  Once we
start telling driver authors to expect specific flows rather than
looking at the operation of each bit, then our implementations become
more fragile, less versatile relative to the user.

What is actually on the table regarding the current QEMU implementation?
The interface is still marked as experimental, so we have some room for
changes, but we probably don't want to throw everything away.

quoted

Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
so a driver needs to be prepared for such an attempted state change
either way.  So what's the advantage to telling a driver author that
they can expect a given behavior?

The above didn't tell a driver author to expect a certain behavior, it
tells userspace what to do.

  "The migration driver can rely on user-space issuing a
   VFIO_DEVICE_RESET prior to starting RESUMING."

I trimmed too much, the original text you quoted was

"To abort a RESUMING issue a VFIO_DEVICE_RESET."

Which I still think is fine.

If we're writing a specification, that's really a MAY statement,
userspace MAY issue a reset to abort the RESUMING process and return
the device to RUNNING.  They MAY also write the device_state directly,
which MAY return an error depending on various factors such as whether
data has been written to the migration state and whether that data is
complete.  If a failed transitions results in an ERROR device_state,
the user MUST issue a reset in order to return it to a RUNNING state
without closing the interface.

Are we actually writing a specification? If yes, we need to be more
clear on what is mandatory (MUST), advised (SHOULD), or allowed
(MAY). If I look at the current proposal, I'm not sure into which
category some of the statements fall.

A recommendation to use reset to skip over potential error conditions
when the goal is simply a new, clean RUNNING state irrespective of data
written to the migration region, is fine.  But drivers shouldn't be
written with only that expectation, just like they shouldn't expect a
reset precondition to entering RESUMING.

quoted

Tracing that shows a reset preceding entering RESUMING doesn't suggest
to me that QEMU is performing a reset for the specific purpose of
entering RESUMING.  Correlation != causation.

Kernel doesn't care why qemu did it - it was done. Intent doesn't
matter :)

This is exactly the sort of "designed for QEMU implementation"
inter-operability that I want to avoid.  It doesn't take much of a
crystal ball to guess that gratuitous and redundant device resets slow
VM instantiation and are a likely target for optimization.

Which brings me back to my question above: Are we wedded to all details
of the current QEMU implementation? Should we consider some of them more
as a MAY? Can we change QEMU to do things differently, given the
experimental nature of the support?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help