Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

From: Alex Williamson <hidden>
Date: 2021-12-20 22:26:31
Also in: kvm

On Tue, 14 Dec 2021 12:26:54 -0400
Jason Gunthorpe [off-list ref] wrote:

On Mon, Dec 13, 2021 at 01:40:38PM -0700, Alex Williamson wrote:

quoted

We do specify that a migration driver has discretion in using the error
state for failed transitions, so there are options to simplify error
handling.

This could be OK if we can agree ERROR has undefined behavior.

quoted

I think the basis of your priority scheme comes from that.  Ordering
of the remaining items is more subtle though, for instance
0 -> SAVING | RUNNING can be broken down as:

  0 -> SAVING
  SAVING -> SAVING | RUNNING 

  vs

  0 -> RUNNING
  RUNNING -> SAVING | RUNNING

I'd give preference to enabling logging before running and I believe
that holds for transition (e) -> (d) as well.

IMHO, any resolution to an arbitary choice should pick an order that
follows the reference flow because we know those are useful sequences
to have today.

Generally we have no use for the sequence SAVING -> SAVING | RUNNING,
while RUNNING -> SAVING | RUNNING is part of the standard flow.

SAVING -> SAVING | RUNNING or SAVING -> RUNNING could both be used as
recovery sequences should the target VM fail.  The latter might be a
full abort of the migration while the former might be a means to reset
the downtime clock without fully restarting the migration.

It also raises the question that it seems not well defined what the
sequence:

SAVING -> SAVING | RUNNING

Even does to the migration window?

Nor can I imagine what mlx5 could do beyond fail this or corrupt the
migration..

I think this comes down to the robustness of the migration driver.  The
migration data window and control of how userspace is to interact with
it is essentially meant to allow the migration driver to implement its
own transport protocol.  In the case of mlx5 where it expects only to
apply the received migration data on the RESUMING -> RUNNING
transition, a "short" data segment might be reserved for providing
sequencing information.  Each time the device enters SAVING | !RUNNING
the driver might begin by presenting a new sequence header.  On the
target, a new sequence header would cause any previously received
migration data to be discarded.  A similar header would also be
suggested to validate the migration data stream is appropriate for the
target device.

Also, I hope it goes without saying, but I'll say it anyway, the
migration data stream would make for an excellent exploit vector and
each migration driver needs to be responsible to make sure that
userspace cannot use it to break containment of the device.

If we keep the "should implement transitions" language below then I
expect mlx5 to fail this, and then we have a conflict where mlx5
cannot implement these precedence rules.

Per above, I think it can.

This is the kind of precedence resolution I was trying to avoid.

As far as I can see the requirements are broadly:
 - Do not transit through an invalid state
 - Do not loose NDMA during the transit, eg NDMA | SAVING | RUNNING -> 0
   should not have a race where a DMA can leak out
 - Do not do transit through things like SAVING -> SAVING | RUNNING,
   and I'm not confident I have a good list of these

"Things like", yeah, that's not really spec material.  There's nothing
special about that transition.  The migration driver should take into
account management of the migration data stream and support such
states, or error the transition if it isn't sufficient ROI and expect
device specific bug reports at some point in the future.

For NDMA, that's a valid consideration.  I think that means though that
NDMA doesn't simply bookend the pseudo algorithm I provided.  Perhaps
it's nested between RUNNING and SAVING handlers on either end.

quoted

And I think that also addresses the claim that we're doomed to untested
and complicated error code handling, we unwind by simply swapping the
args to our set state function and enter the ERROR state should that
recursive call fail.

I had the same thought the day after I wrote this, it seems workable.

I remain concerned however that we still can't seem to reach to a
working precedence after all this time. This is a very bad sign. 

Even if we work something out adding a new state someday is
terrifying. What if we can't work out any precedence that is
compatible with todays and supports the new state?

IMHO, we should be simplifing this before it becomes permanent API,
not trying to force it to work.

I agree, this is our opportunity to simplify before we're committed,
but I don't see how we can throw out perfectly valid transitions like
SAVING -> SAVING | RUNNING just because the driver hasn't accounted for
managing data in the data stream.

quoted

If we put it in the user's hands and prescribe only single bit flips,
they don't really have device knowledge to optimize further than this
like a migration driver might be able to do.

If so this argues we should go back to the enforced FSM that the v1
mlx5 posting had and forget about device_state as a bunch of bits.

Most of things I brought up in this post are resolved by the forced
FSM.

Until userspace tries to do something different than exactly what it
does today, and then what?

Yes, we give up some flexability, but I think the quest for
flexability is a little misguided. If the drivers don't consistently
implement the flexability then it is just cruft we cannot make use of
from userspace.

eg what practical use is SAVING -> SAVING | RUNNING if today's mlx5
implementation silently corrupts the migration stream? That instantly
makes that a no-go for userspace from an interoperability perspective
and we've accomplished nothing by allowing for it.

Failure to support that transition is a deficiency of the driver and
represented by a non-silent error in making that transition.  Silently
corrupting the migration stream is simply a driver bug.

Please think about it, it looks like an easy resolution to all this
discussion to simply specify a fixed FSM and be done with it.

quoted

I thought we could tackled this when you first suggested it (eg copy
the mlx5 logic and be OK), but now I'm very skeptical. The idea that
every driver can do this right in all the corner cases doesn't seem
reasonable given we've made so many errors here already just in mlx5.

quoted

+ *     - Bit 1 (SAVING) [REQUIRED]:
+ *        - Setting this bit enables and initializes the migration region data

I would use the word clear instead of initialize - the point of this
is to throw away any data that may be left over in the window from any
prior actions.

"Clear" to me suggests that there's some sort of internal shared buffer
implementation that needs to be wiped between different modes.  I chose
"initialize" because I think it offers more independence to the
implementation.

The data window is expressed as a shared buffer in this API, there is
only one data_offset/size and data window for everything.

Any access to the data window outside of that directed by the driver is
undefined, it's up to the driver where and how to populate the data.  A
driver might make a portion of the data window available as an mmap
that gets zapped and faulted to the correct device backing between
operations.

I think it is fine to rely on that for the description, and like all
abstractions an implementation can do whatever so long as externally
it looks like this shared buffer API.

The requirement here is that anything that pre-existed in the data
window from any prior operation is cleaned and the data window starts
empty before any data related to this SAVING is transfered.

IOW, it's initialized.  We're picking out colors for the bike shed at
this point.

quoted

+ *          window and associated fields within vfio_device_migration_info for
+ *          capturing the migration data stream for the device.  The migration
+ *          driver may perform actions such as enabling dirty logging of device
+ *          state with this bit.  The SAVING bit is mutually exclusive with the
+ *          RESUMING bit defined below.
+ *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
+ *          data window and indicates the completion or termination of the
+ *          migration data stream for the device.

I don't know what "de-initialized" means as something a device should
do? IMHO there is no need to talk about the migration window here,
SAVING says initialize/clear - and data_offset/etc say their values
are undefined outside SAVING/RUNNING. That is enough.

If "initializing" the migration data region puts in place handlers for
pending_bytes and friends, "de-initializing" would undo that operation.
Perhaps I should use "deactivates"?

And if you don't use "initializing" we don't need to talk about
"de-initializating".

Reading the data window outside SAVING is undefined behavior it seems,
so nothing needs to be said.

Exactly why I thought simply describing it as the reciprocal of setting
the bit would be sufficient.  Taupe!

quoted

+ *     - Bit 2 (RESUMING) [REQUIRED]:
+ *        - Setting this bit enables and initializes the migration region data
+ *          window and associated fields within vfio_device_migration_info for
+ *          restoring the device from a migration data stream captured from a
+ *          SAVING session with a compatible device.  The migration driver may
+ *          perform internal device resets as necessary to reinitialize the
+ *          internal device state for the incoming migration data.
+ *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
+ *          region data window and indicates the end of a resuming session for
+ *          the device.  The kernel migration driver should complete the
+ *          incorporation of data written to the migration data window into the
+ *          device internal state and perform final validity and consistency
+ *          checking of the new device state.  If the user provided data is
+ *          found to be incomplete, inconsistent, or otherwise invalid, the
+ *          migration driver must indicate a write(2) error and follow the
+ *          previously described protocol to return either the previous state
+ *          or an error state.

Prefer this is just 'go to an error state' to avoid unnecessary
implementation differences.

Then it becomes a special case versus other device_state changes and
we're forcing what you've described as an undefined state into the
protocol.

Lets look at what recovery actions something the VMM would need to
take along the reference flow:

RUNNING -> SAVING | RUNNING
  If this fails and we are still in RUNNING and can continue

 -> SAVING | RUNNING | NDMA
 -> SAVING  
  If these fail we need to go to RUNNING
  -> RUNNING  
    If this fails we need to RESET

I won't argue that there aren't transition failures where the next
logical step is likely a reset, but I also don't see the point in
defining special rules for certain cases.  When in doubt, leave policy
decisions to userspace?

 -> 0  
  Migration succeeded? Failure here should RESET

RUNNING -> RESUMING
  If this fails and we are still in RUNNING continue
 -> NDMA | RUNNING  
  If this fails RESET
 -> RUNNING  
  If this fails RESET, VM could be corrupted.

One view is that what the device does is irrelevant as qemu should
simply unconditionally reset in these case.

Another view is that staying in a useless state is also pointless and
we may as well return ERROR anyhow. Eg if exiting RESUMING failed
there is no other action to take besides RESET, so why didn't we
return ERROR to tell this directly to userspace?

And then the last packet arrives, gets written to the device that's
still in RESUMING, and now can transition to RUNNING.

Both are reasonable views, which is why I wrote "prefer".

There's no going back from the ERROR state, the only path the user has
forward is to reset the device.  Therefore the only case I'm willing to
say it's the preferred next state is in the case of an irrecoverable
internal fault.  I'd also like to avoid another lengthy discussion
trying to define which specific transitions should default to an ERROR
state if they fail versus simply return -errno.  Userspace is free to
define a policy where an -errno is considered fatal for the device.  I
prefer consistent userspace handling, letting userspace define policy,
and robust drivers that avoid forcing unnecessary user decisions.

quoted

+ *     - Bit 3 (NDMA) [OPTIONAL]:
+ *        The NDMA or "No DMA" state is intended to be a quiescent state for
+ *        the device for the purposes of managing multiple devices within a
+ *        user context where peer-to-peer DMA between devices may be active.
+ *        Support for the NDMA bit is indicated through the presence of the
+ *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
+ *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
+ *        region.
+ *        - Setting this bit must prevent the device from initiating any
+ *          new DMA or interrupt transactions.  The migration driver must

I'm not sure about interrupts.

In the common case an interrupt is a DMA write, so the name, if not
intention of this state gets a bit shaky if interrupts are allowed.

Interrupts have their own masking protocol. For instance a device like
the huawei one is halting DMA by manipulating the queue registers, it
may still generate interrupts.

Yes, MSI is a MemWr, but I've never heard anyone call it a DMA - there
is no memory access here since the TLP is routed to the interrupt
controller.

That's a pretty subtle distinction.  Can't that controller live in MMIO
space and isn't it then just a peer-to-peer DMA?  I know I'm not the
first to consider MSI to be just another DMA, that seems to be the
basis of it's handling on ARM SMMU.

This is why I'm not sure. A hostile VM certainly can corrupt the MSI,
even today and thus turn it into a DMA. As we talked before this may
be OK, but is security risky that it allows the guest to impact the
hypervisor.

Overall it seems like this is more trouble for a device like huawei's
if they want to implement NDMA using the trapping or something. Given
your right concern that NDMA should be implemented as widely as
possible making it more difficult that stricly necessary is perhaps
not good.

Other peope should comment here.

Yeah, I'm not clear on what devices can and cannot do in the NDMA state.
Ultimately the goal is that once all devices are in the NDMA state, we
can safely transition them to the !RUNNING state without concern
regarding access from another device.  Specifically we want to avoid
things like DeviceA moves to !RUNNING while DeviceB initiates a DMA
access to DeviceA which now cannot respond without advancing internal
state which violates the condition of !RUNNING.

But I think we're really only after that p2p behavior and we've
discussed that disabling p2p mappings in the VM would be a sufficient
condition to support multiple devices without NDMA support.  I think
that means DMA to memory is fine and DMA related to MSI is fine... but
how does a device know which DMA is memory and which DMA is another
device?

quoted

+ *          complete any such outstanding operations prior to completing
+ *          the transition to the NDMA state.  The NDMA device_state

Reading this as you wrote it and I suddenly have a doubt about the PRI
use case. Is it reasonable that the kernel driver will block on NDMA
waiting for another userspace thread to resolve any outstanding PRIs?

Can that allow userspace to deadlock the kernel or device? Is there an
alterative?

I'd hope we could avoid deadlock in the kernel, but it seems trickier
for userspace to be waiting on a write(2) operation to the device while
also handling page request events for that same device.  Is this
something more like a pending transaction bit where userspace asks the
device to go quiescent and polls for that to occur?

Hum. I'm still looking into this question, but some further thoughts.

PRI doesn't do DMA, it just transfers a physical address into the PCI
device's cache that can be later used with DMA.

PRI also doesn't imply the vPRI Intel is talking about.

For PRI controlled by the hypervisor, it is completely reasonable that
NDMA returns synchronously after the PRI and the DMA that triggered it
completes. The VMM would have to understand this and ensure it doesn't
block the kernel's fault path while going to NDMA eg with userfaultfd
or something else crazy.

The other reasonable option is that NDMA cancels the DMA that
triggered the PRI and simply doesn't care how the PRI is completed
after NDMA returns.

The later is interesting because it is a possible better path to solve
the vPRI problem Intel brought up. Waiting for the VCPU is just asking
for a DOS, if NDMA can cancel the DMAs we can then just directly fail
the open PRI in the hypervisor and we don't need to care about the
VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the
resume'd device simply issues a new DMA with an empty ATS cache and
does a new PRI.

It is uncertain enough that qemu should not support vPRI with
migration until we define protocol(s) and a cap flag to say the device
supports it.

quoted

+ *   All combinations for the above defined device_state bits are considered
+ *   valid with the following exceptions:
+ *     - RESUMING and SAVING are mutually exclusive, all combinations of
+ *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
+ *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
+ *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
+ *       specifically chosen due to the !RUNNING state of the device as the
+ *       migration driver should do everything possible, including an internal
+ *       reset of the device, to ensure that the device is fully stopped in
+ *       this state.

Prefer we don't specify this. ERROR is undefined behavior and
userspace should reset. Any path that leads along to ERROR already
includes possiblities for wild DMAs and what not, so there is nothing
to be gained by this other than causing a lot of driver complexity,
IMHO.

This seems contrary to your push for consistent, interoperable
behavior.

Formal "undefined behavior" can be a useful part of a spec, especially
if the spec is 'when you see ERROR you must do RESET', we don't really
need to constrain the device further to continue to have
interoperability.

quoted

What's the benefit to actually leaving the state undefined or the
drawback to preemptively resetting a device if the migration driver
cannot determine if the device is quiesced,

RESET puts the device back to RUNNING, so RESET alone does not remedy
the problem.

RESET followed by !RUNNING can fail, meaning the best mlx5 can do is
"SHOULD", in which case lets omit the RESET since userspace can't rely
on it.

Even if it did work reliably, the requirement is userspace must issue
RESET to exit ERROR and if we say the driver has to issue reset to
enter ERROR we are just doing a pointless double RESET.

Please read what I wrote:

    This variant is specifically chosen due to the !RUNNING state of
    the device as the migration driver should do everything possible,
    including an internal reset of the device, to ensure that the
    device is fully stopped in this state.

That does not say that a driver must issue a reset to enter the ERROR
state.  Perhaps it's wrong that I'm equating this so formally to the
!RUNNING state when really we don't care about the internal state of
the device, we just want it to not corrupt memory or generate spurious
interrupts.  I'm thinking the equivalent of clear bus-master for PCI
devices.  Would it be sufficient if I clarified !RUNNING relative to
DMA and interrupt generation?

quoted

would need to reset the device to enter a new state anyway?  I added
this because language in your doc suggested the error state was far
more undefined that I understood it to be, ie. !RUNNING.

Yes it was like that, because the implementation of this strict
requirement is not nice.

Perhaps a middle ground can work:

  For device_state ERROR the device SHOULD have the device
  !RUNNING. If the ERROR arose due to a device_state change and
  if the new and old states have NDMA behavior then the device MUST
  maintain NDMA behavior while processing the device_state and
  continuing while in ERROR. Userspace MUST reset the device to
  recover from ERROR, therefore devices SHOULD NOT do a redundant
  internal reset

I don't have a problem if the reset is redundant to one the user needs
to do anyway, I'd rather see any externally visible operation of the
device stopped ASAP.  The new and old state NDMA-like properties is
also irrelevant, if a device enters an ERROR state moving from RUNNING
-> SAVING | RUNNING it shouldn't continue manipulating memory and
generating interrupts in the background.  What about:

    The !RUNNING variant is used here specifically to reflect that the
    device should immediately cease all external operations such as DMA
    and interrupts.  The migration driver should do everything
    possible, up to and including an internal reset of the device, to
    ensure that the device is externally quiescent in this state.

quoted

+ *   Migration drivers should attempt to support any
    transition between valid

should? must, I think.

I think that "must" terminology is a bit contrary to the fact that
    we have a defined error state that can be used at the
    discretion of the migration driver.  To me, "should" tells the
    migration drivers that they ought to make an attempt to support
    all transitions, but userspace needs to be be prepared that
    they might not work.

IMHO this is not inter-operable. At a minimum we should expect that a
driver implements a set of standard transitions, or it has to
implement all of them.

Otherwise what is the point?

If you go back to the mlx5 v1 version it did effectively this. It
enforced a FSM and only allowed some transitions. That meets the
language here with "should" but you didn't like it, and I agreed with
you then.

This is when the trouble stated :)

The mlx5 v1 with the FSM didn't have alot of these problems we are
discussing. It didn't have precedence issues, it didn't have problems
executing odd combinations it can't support, it worked and was simple
to understand.

And an audit of that driver during review found that it grossly failed
to meet the spirit of a "should" requirement.

Using "should" terminology here is meant to give the driver some
leeway, it's not an invitation for abuse.  Even below there's still a
notion that a given state transitions is unsupportable by your device,
what if that was actually true?

So, if we say should here, then I vote mlx5 goes back to enforcing
its FSM and that becomes the LCD that userspace must implement to.

In which case, why not formally specify the FSM now and avoid a driver
pushing a defacto spec?

It really only takes one driver implementing something like SAVING ->
SAVING | RUNNING and QEMU taking advantage of it as a supported
transition per the uAPI for mlx5 to be left out of the feature that
might provide.

If we say MUST here then we need to figure out a precedence and 
say that some transitions are undefined behavior, like SAVING ->
SAVING|RUNNING.

If we say "should" and don't do those thing, then we're still not
implementing to the spirit of the uAPI.  I'm hearing a lot of "may" in
your interpretation of "should".  And again, nothing wrong with that
transition, manage the migration stream better.  Thanks,

Alex

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help