Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler

[PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 01/14] PCI/IOV: Add pci_iov_vf_id() to get VF index · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 02/14] net/mlx5: Reuse exported virtfn index function call · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 04/14] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 05/14] net/mlx5: Expose APIs to get/put the mlx5 core device · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 03/14] net/mlx5: Disable SRIOV before PF removal · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device · Max Gurtovoy <mgurtovoy@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device · Yishai Hadas <yishaih@nvidia.com> · 2021-10-20
[PATCH V2 mlx5-next 07/14] vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR · Alex Williamson <hidden> · 2021-10-19
Re: [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR · Alex Williamson <hidden> · 2021-10-19
Re: [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR · Yishai Hadas <yishaih@nvidia.com> · 2021-10-20
[PATCH V2 mlx5-next 09/14] vfio/pci_core: Make the region->release() function optional · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 10/14] net/mlx5: Introduce migration bits and structures · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 11/14] vfio/mlx5: Expose migration commands over mlx5 device · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-19
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-19
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-10-21
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-21
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-27
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-27
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-10-28
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-28
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-28
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-29
RE: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Shameerali Kolothum Thodi <hidden> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-29
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-01
RE: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Shameerali Kolothum Thodi <hidden> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-02
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-03
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-03
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-03
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-03
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-11-04
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-11-05
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Cornelia Huck <cohuck@redhat.com> · 2021-11-16
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-05
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-05
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-15
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-16
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-16
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-16
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-17
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-11-18
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-22
RE: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · "Tian, Kevin" <kevin.tian@intel.com> · 2021-11-08
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-08
RE: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · "Tian, Kevin" <kevin.tian@intel.com> · 2021-11-09
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-09
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Dr. David Alan Gilbert <hidden> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Dr. David Alan Gilbert <hidden> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-25
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Dr. David Alan Gilbert <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Alex Williamson <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Dr. David Alan Gilbert <hidden> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-26
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices · Yishai Hadas <yishaih@nvidia.com> · 2021-10-21
[PATCH V2 mlx5-next 13/14] vfio/pci: Expose vfio_pci_aer_err_detected() · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
[PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Yishai Hadas <yishaih@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Alex Williamson <hidden> · 2021-10-19
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-19
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Yishai Hadas <yishaih@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Alex Williamson <hidden> · 2021-10-20
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Jason Gunthorpe <jgg@nvidia.com> · 2021-10-20
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Alex Williamson <hidden> · 2021-10-20
Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler · Yishai Hadas <yishaih@nvidia.com> · 2021-10-21
vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) · Cornelia Huck <cohuck@redhat.com> · 2021-11-17
Re: vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) · Jason Gunthorpe <jgg@nvidia.com> · 2021-11-17

From: Alex Williamson <hidden>
Date: 2021-10-20 21:38:20
Also in: kvm, linux-pci

On Wed, 20 Oct 2021 15:57:21 -0300
Jason Gunthorpe [off-list ref] wrote:

On Wed, Oct 20, 2021 at 11:45:14AM -0600, Alex Williamson wrote:

quoted

On Wed, 20 Oct 2021 13:46:29 -0300
Jason Gunthorpe [off-list ref] wrote:

quoted

On Wed, Oct 20, 2021 at 11:46:07AM +0300, Yishai Hadas wrote:

quoted

What is the expectation for a reasonable delay ? we may expect this system
WQ to run only short tasks and be very responsive.

If the expectation is that qemu will see the error return and the turn
around and issue FLR followed by another state operation then it does
seem strange that there would be a delay.

On the other hand, this doesn't seem that useful. If qemu tries to
migrate and the device fails then the migration operation is toast and
possibly the device is wrecked. It can't really issue a FLR without
coordinating with the VM, and it cannot resume the VM as the device is
now irrecoverably messed up.

If we look at this from a RAS perspective would would be useful here
is a way for qemu to request a fail safe migration data. This must
always be available and cannot fail.

When the failsafe is loaded into the device it would trigger the
device's built-in RAS features to co-ordinate with the VM driver and
recover. Perhaps qemu would also have to inject an AER or something.

Basically instead of the device starting in an "empty ready to use
state" it would start in a "failure detected, needs recovery" state.

The "fail-safe recovery state" is essentially the reset state of the
device.

This is only the case if qemu does work to isolate the recently FLR'd
device from the VM until the VM acknowledges that it understands it is
FLR'd.

At least it would have to remove it from CPU access and the IOMMU, as
though the memory enable bit was cleared.

Is it reasonable to do this using just qemu, AER and no device
support?

I suspect yes, worst case could be a surprise hot-remove or DPC event,
but IIRC Linux will reset a device on a fatal AER error regardless of
the driver.

quoted

If a device enters an error state during migration, I would
think the ultimate recovery procedure would be to abort the migration,
send an AER to the VM, whereby the guest would trigger a reset, and
the RAS capabilities of the guest would handle failing over to a
multipath device, ejecting the failing device, etc.

Yes, this is my thinking, except I would not abort the migration but
continue on to the new hypervisor and then do the RAS recovery with
the new device.

Potentially a valid option, QEMU might optionally insert a subsection in
the migration stream to indicate the device failed during the migration
process.  The option might also allow migrating devices that don't
support migration, ie. the recovery process on the target is the same.
This is essentially a policy decision and I think QEMU probably leans
more towards failing the migration and letting a management tool
decided on the next course of action.  Thanks,

Alex

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help