Thread (40 messages) 40 messages, 4 authors, 2015-12-09

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

From: Alexander Duyck <hidden>
Date: 2015-12-01 18:36:37
Also in: intel-wired-lan, kvm, lkml, netdev, qemu-devel

On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin [off-list ref] wrote:
On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
quoted
On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin [off-list ref] wrote:
quoted
quoted
There are several components to this:
- dma_map_* needs to prevent page from
  being migrated while device is running.
  For example, expose some kind of bitmap from guest
  to host, set bit there while page is mapped.
  What happens if we stop the guest and some
  bits are still set? See dma_alloc_coherent below
  for some ideas.
Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?
I have some ideas. Will post later.
I look forward to it.
quoted
quoted
- dma_unmap_* needs to mark page as dirty
  This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
  This is trickier as we can not change the data.
  One solution is using atomics.
  For example:
        int x = ACCESS_ONCE(*p);
        cmpxchg(p, x, x);
  Seems to do a write without changing page
  contents.
Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.
Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with ffffffff

It's just that cmpxchg already happens to have a portable
wrapper.
I was originally thinking maybe an atomic_add with 0 would be the way
to go.  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.
quoted
quoted
- dma_alloc_coherent memory (e.g. device rings)
  must be migrated after device stopped modifying it.
  Just stopping the VCPU is not enough:
  you must make sure device is not changing it.

  Or maybe the device has some kind of ring flush operation,
  if there was a reasonably portable way to do this
  (e.g. a flush capability could maybe be added to SRIOV)
  then hypervisor could do this.
This is where things start to get messy. I was suggesting the
suspend/resume to resolve this bit, but it might be possible to also
deal with this via something like this via clearing the bus master
enable bit for the VF.  If I am not mistaken that should disable MSI-X
interrupts and halt any DMA.  That should work as long as you have
some mechanism that is tracking the pages in use for DMA.
A bigger issue is recovering afterwards.
Agreed.
quoted
quoted
  In case you need to resume on source, you
  really need to follow the same path
  as on destination, preferably detecting
  device reset and restoring the device
  state.
The problem with detecting the reset is that you would likely have to
be polling to do something like that.
We could some event to guest to notify it about this event
through a new or existing channel.

Or we could make it possible for userspace to trigger this,
then notify guest through the guest agent.
The first thing that comes to mind would be to use something like PCIe
Advanced Error Reporting, however I don't know if we can put a
requirement on the system supporting the q35 machine type or not in
order to support migration.
quoted
 I believe the fm10k driver
already has code like that in place where it will detect a reset as a
part of its watchdog, however the response time is something like 2
seconds for that.  That was one of the reasons I preferred something
like hot-plug as that should be functioning as soon as the guest is up
and it is a mechanism that operates outside of the VF drivers.
That's pretty minor.
A bigger issue is making sure guest does not crash
when device is suddenly reset under it's legs.
I know the ixgbevf driver should already have logic to address some of
that.  If you look through the code there should be logic there for a
surprise removal support in ixgbevf.  The only issue is that unlike
fm10k it will not restore itself after a resume or slot_reset call.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help