Re: [PATCH RFC v2] vfio: Documentation for the migration region

From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2021-12-01 03:14:13
Also in: kvm

On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

quoted

From what HNS said the device driver would have to trap every MMIO to
implement NDMA as it must prevent touches to the physical HW MMIO to
maintain the NDMA state.

The issue is that the HW migration registers can stop processing the
queue and thus enter NDMA but a MMIO touch can resume queue
processing, so NDMA cannot be sustained.

Trapping every MMIO would have a huge negative performance impact.  So
it doesn't make sense to do so for a device that is not intended to be
used in any situation where NDMA is required.

But migration is a cooperative activity with userspace.  If necessary
we can impose a requirement that mmap access to regions (other than the
migration region itself) are dropped when we're in the NDMA or !RUNNING
device_state.

It is always NDMA|RUNNING, so we can't fully drop access to
MMIO. Userspace would have to transfer from direct MMIO to
trapping. With enough new kernel infrastructure and qemu support it
could be done.

Even so, we can't trap accesses through the IOMMU so such a scheme
would still require removing IOMMU acess to the device. Given that the
basic qemu mitigation for no NDMA support is to eliminate P2P cases by
removing the IOMMU mappings this doesn't seem to advance anything and
only creates complexity.

At least I'm not going to insist that hns do all kinds of work like
this for a edge case they don't care about as a precondition to get a
migration driver.

There's no reason that mediation while in the NDMA state needs to
impose any performance penalty against the default RUNNING state.

Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
a performance doorbell path, there is no escaping a performance
hit. I'm not sure what you mean

quoted

Some discussion of this requirement would be useful in the doc,
otherwise it seems easier to deprecate the v1 migration region
sub-type, and increment to a v2 where NDMA is a required feature.

I can add some words a the bottom, but since NDMA is a completely
transparent optional feature I don't see any reason to have v2.

It's hardly transparent, aiui userspace is going to need to impose a
variety of loosely defined restrictions for devices without NDMA
support.  It would be far easier if we could declare NDMA support to be
a requirement.

It would make userspace a bit simpler at the cost of excluding or
complicating devices like hns for a use case they don't care about.

On the other hand, the simple solution in qemu is when there is no
universal NDMA it simply doesn't include any MMIO ranges in the
IOMMU.

As I think Connie also had trouble with, combining device_state with
IOMMU migration features and VMM state, without any preceding context
and visual cues makes the section confusing.  I did gain context as I
read further though the doc, but I also had the advantage of being
rather familiar with the topic.  Maybe a table format would help to
segment the responsibilities?

I moved the context to the bottom exactly because Connie said it was
confusing at the start. :)

This is a RST document so I not keen to make huge formatting
adventures for minimal readability gain.

I view this as something that probably needs to be read a few times,
along with the code and header files for someone brand new to
understand. I'm Ok with that, it is about consistent with kernel docs
of this level.

What I would like is if userspace focused readers can get their
important bits of information with less work.

quoted

It is exsisting behavior of qemu - which is why we documented it.

QEMU resets devices as part of initializing the VM, but I don't see
that QEMU specifically resets a device in order to transition it to
the RESUMING device_state.

We instrumented the kernel and monitored qemu, it showed up on the
resume traces.

quoted

Either qemu shouldn't do it as devices must fully self-reset, or we
should have it part of the canonical flow and devices may as well
expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
quiet, no outstanding PRIs exist, etc etc.

It's valid for QEMU to reset the device any time it wants, saying that
it cannot perform a reset before transitioning to the RESUMING state is
absurd.  Userspace can do redundant things for their own convenience.

I didn't say cannot, I said it shouldn't.

Since qemu is the only implementation it would be easy for drivers to
rely on the implicit reset it seems to do, it seems an important point
that should be written either way.

I don't have a particular requirement to have the reset, but it does
seem like a good idea. If you feel strongly, then let's say the
opposite that the driver must enter RESUME with no preconditions,
doing an internal reset if required.

We don't currently specify any precondition for a device to enter the
RESUMING state.  The driver can of course nak the state change with an
errno, or hard nak it with an errno and ERROR device_state, which would
require userspace to make use of VFIO_DEVICE_RESET.

I don't think we should be relying on every driver doing something
totally differnt on the standard path. That is only going to hurt
interoperability.

quoted

As with the previous flows, it seems like there's a ton of implicit
knowledge here.  Why are we documenting these here rather than in the
uAPI header?

Because this is 300 lines already and is too complicated/long to
properly live in a uapi header.

Minimally we need to resolve that this document must be consistent with
the uAPI.  I'm not sure that's entirely the case in this draft.

Can you point to something please? I can't work with "I'm not sure"

IMO the header file doesn't really say much and can be read in a way
that is consistent with this more specific document.

quoted

 - qemu doesn't support P2P cases due to the NDMA topic

Or rather QEMU does support p2p cases regardless of the NDMA topic.

I mean support in a way that is actually usable as without NDMA it
corrupts the VM when it migrates it.

quoted

 - simple devices like HNS will work, but not robustly in the face of
   a hostile VM and multiple VFIO devices.

So what's the goal here, are we trying to make the one currently
implemented and unsupported userspace be the gold standard to which
drivers should base their implementation?

I have no idea anymore. You asked for docs and complete picture as a
percondition for merging a driver. Here it is.

What do you want?

We've tried to define a specification that's more flexible than a
single implementation and by these standards we seem to be flipping
that implementation back into the specification.

What specification!?! All we have is a couple lines in a header file
that is no where near detailed enough for multi-driver
interoperability with userspace. You have no idea how much effort has
been expended to get this far based on the few breadcrumbs that were
left, and we have access to the team that made the only other
implementation!

*flexible* is not a specification.

Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
so a driver needs to be prepared for such an attempted state change
either way.  So what's the advantage to telling a driver author that
they can expect a given behavior?

The above didn't tell a driver author to expect a certain behavior, it
tells userspace what to do.

It doesn't make much sense to me to glue two separate userspace
operations together to say these must be done in this sequence, back to
back.  If we want the device to be reset in order to enter RESUMING, the
driver should simply reset the device as necessary during the state
transition.  The outward effect to the user is to specify that device
internal state may not be retained on transition from RUNNING ->
RESUMING.

Maybe, and I'm happy if you want to specify this instead. It just
doesn't match what we observe qemu to be doing.

quoted

Do you have an alternative language? This is quite complicated, I
advise people to refer to mlx5's implementation.

I agree with Connie on this, if the reader of the documentation needs
to look at a specific driver implementation to understand the
reasoning, the documentation has failed.

Lets agree on some objective here, this is not trying to be fully
comprehensive, or fully standalone. It is intended to drive agreement,
be informative to userspace, and be supplemental to the actual code.

If it can be worked out by looking at the device_state write
function of the mlx5 driver, then surely a sentence or two for each
priority item can be added here.

Please give me a suggestion then, because I don't know what will help
here?

Part of the problem is that the nomenclature is unclear, we're listing
bit combinations, but not the changed bit(s) and we need to infer the
state.

Each line lists the new state, the changed bits are thus any bits that
make up the new state.

If you look at how mlx5 is constructed each if has a 'did it change'
test followed by 'what state is it in now'

So the document is read as listing the order the driver enters the new
states. I clarified it as ""must process the new device_state bits in
a priority order""

flips in the presence of an existing state.  I'm not able to obviously
map the listing above to the latest posted version of the mlx5 driver.

One of the things we've done is align mlx5 more clearly to this. For
instance it no longer has a mixture of state and old state in the if
statements, it always tests the new state so the tests logically
follow what is written here

Stripping away the excess the expressions now look like this:

 !(state & VFIO_DEVICE_STATE_RUNNING)
 ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING))
 (state & VFIO_DEVICE_STATE_RESUMING)

Which mirror what is written here.

quoted

+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
+  device does not support NDMA and rely on that to guarantee quiet MMIO.

Seems that would have its own set of consequences.

Sure, userspace has to make choices here.

It seems a bit loaded to suggest an alternative choice if it's not
practical or equivalent.  Maybe it's largely the phrasing, I read
"remove MMIO mappings" as to drop them dynamically, when I think we've
discussed that userspace might actually preclude these mappings for
non-NDMA devices such that p2p DMA cannot exist, ever.

I mean the latter. How about "never install MMIO mappings" ?

quoted

Overall it must work in this basic way, and devices have freedom about
what internal state they can/will log. There is just a clear division
that every internal state in the first step is either immutable or
logged, and that the second step is a delta over the first.

I agree that it's a reasonable approach, though as I read the proposed
text, there's no mention of immutable state and no reason a driver
would implement a dirty log for immutable state, therefore we seem to
be suggesting such data for the stop-and-copy phase when it would
actually be preferable to include it in pre-copy.

I'd say that is a detail we don't need to discuss/define, it has no
user space visible consequence.

I think the fact that a user is not required to run the pre-copy
phase until completion is also noteworthy.

This text doesn't try to detail how the migration window works, that
is a different large task. The intention is that the migration window
must be fully drained to be successful.

I added this for some clarity ""The entire migration data, up to each
end of stream must be transported from the saving to resuming side.""

quoted

Yishai has a patch already to add NDMA to mlx5, it will come in the
next iteration once we can agree on this document. qemu will follow
sometime later.

So it's not really a TBD, it's resolved in a uAPI update that will be
included with the next revision?  Thanks,

There is a patch yes, the TBD here is to include a few words about how
to detect NDMA

Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help