Thread (100 messages) 100 messages, 8 authors, 2021-11-22

Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

From: Alex Williamson <hidden>
Date: 2021-11-18 18:16:02
Also in: kvm, linux-pci

On Tue, 16 Nov 2021 21:48:31 -0400
Jason Gunthorpe [off-list ref] wrote:
On Tue, Nov 16, 2021 at 02:10:31PM -0700, Alex Williamson wrote:
quoted
On Tue, 16 Nov 2021 15:25:05 -0400
Jason Gunthorpe [off-list ref] wrote:
  
quoted
On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote:
  
quoted
quoted
I think userspace should decide if it wants to use mlx5 built in or
the system IOMMU to do dirty tracking.    
What information does userspace use to inform such a decision?    
Kernel can't know which approach performs better. Operators should
benchmark and make a choice for their deployment HW. Maybe device
tracking severely impacts device performance or vice versa.  
I'm all for keeping policy decisions out of the kernel, but it's pretty
absurd to expect a userspace operator to benchmark various combination
and wire various knobs through the user interface for this.   
Huh? This is the standard operating procedure in netdev land, mm and
others. Max performance requires tunables. And here we really do care
alot about migration time. I've seen the mm complexity supporting the
normal vCPU migration to know this is a high priority for many people.
Per previous reply:

"Maybe start with what uAPI visible knobs really make sense
and provide a benefit for per-device dirty bitmap tuning and how a
device agnostic userspace like QEMU is going to make intelligent
decisions about those knobs."
quoted
It seems to me that the kernel, ie. the vfio variant driver, *can*
know the best default.    
If the kernel can have an algorithm to find the best default then qemu
can implement the same algorithm too. I'm also skeptical about this
claim the best is knowable.
QEMU is device agnostic and has no visibility to the system IOMMU
capabilities.
 
quoted
quoted
Kernel doesn't easily know what userspace has done, maybe one device
supports migration driver dirty tracking and one device does not.  
And that's exactly why the current type1 implementation exposes the
least common denominator to userspace, ie. pinned pages only if all
devices in the container have enabled this degree of granularity.  
The current implementation forces no dirty tracking. That is a policy
choice that denies userspace the option to run one device with dirty
track and simply suspend the other.
"Forces no dirty tracking"?  I think you're suggesting that if a device
within an IOMMU context doesn't support dirty tracking that we're
forced to always report all mappings within that context as dirty,
because for some reason we're not allowed to evolve how devices can
interact with a shared dirty context, but we are allowed to invent a new
per-device uAPI?
quoted
quoted
So, I would like to see userspace control most of the policy aspects,
including the dirty track provider.  
This sounds like device specific migration parameter tuning via a
devlink interface to me, tbh.  How would you propose a generic
vfio/iommufd interface to tune this sort of thing?  
As I said, if each page track provider has its own interface it is
straightforward to make policy in userspace. The only tunable beyond
that is the dirty page tracking granularity. That is already
provided by userspace, but not as an parameter during START.

I don't see why we'd need something as complicated as devlink just
yet.
The approaches aren't that different.  It seems like you want userspace
to be able to select which devices to get dirty page info from while
I'm suggesting that we can develop ways that the user can manage how
the device interacts with a shared dirty page state.  Per-device dirty
page state doesn't preclude a shared dirty state, we still expect that
any IOMMU interface is the source of truth when we have multiple
devices sharing an IOMMU context.

What if we have a scenario where devices optionally have per device
dirty page tracking.

A) mlx5 w/ device level dirty page tracking, vGPU w/ page pinning

vs

B) mlx5 w/ device level dirty page tracking, NIC with system IOMMU
   dirty page tracking

Compare and contrast in which case the shared IOMMU context dirty page
tracking reports both device's versus only the devices without
per-device tracking.  Is this obvious to both userspace and the shared
dirty page tracking if it's the same or different in both cases?
quoted
quoted
How does userspace know if dirty tracking works or not? All I see
VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps.  
IIRC, it's always supported by type1.  In the worst case we always
report all mapped pages as dirty.  
Again this denies userspace a policy choice. Why do dirty tracking
gyrations if they don't work? Just directly suspend the devices and
then copy.
As above, it seems like you're freezing one interface and allowing the
other to evolve.  We can create ways that a device without dirty
tracking can be marked as not generating DMA within a shared dirty page
context.

If maybe what you're really trying to get at all along is visibility to
the per-device dirty page contribution, that does sound interesting.
But I also don't know how that interacts with system IOMMU based
reporting.  Does the IOMMU report to the driver that reports to
userspace the per-device dirty pages?
quoted
quoted
I'm surprised it doesn't check that only NO_IOMMU's devices are
attached to the container and refuse to dirty track otherwise - since
it doesn't work..  
No-IOMMU doesn't use type1, the ioctl returns errno.  
Sorry, I mistyped that, I ment emulated iommu, as HCH has called it:

vfio_register_emulated_iommu_dev()
Ok, so that's essentially the vfio IOMMU driver just keeping track of
userspace mappings in order to provide page pinning, where a device
supporting page pinning is our first step in more fine grained dirty
tracking.  Type1 keeps track of pages pinned across all devices sharing
the IOMMU container, thus we have container based dirty page reporting.
 
quoted
quoted
When we mix dirty track with pre-copy, the progression seems to be:

  DITRY TRACKING | RUNNING
     Copy every page to the remote
  DT | SAVING | RUNNING
     Copy pre-copy migration data to the remote
  SAVING | NDMA | RUNNING
     Read and clear dirty track device bitmap
  DT | SAVING | RUNNING
     Copy new dirtied data
     (maybe loop back to NDMA a few times?)
  SAVING | NDMA | RUNNING
     P2P grace state
  0
    Read the dirty track and copy data
    Read and send the migration state

Can we do something so complex using only SAVING?  
I'm not demanding that triggering device dirty tracking on saving is
how this must be done, I'm only stating that's an idea that was
discussed.  If we need more complicated triggers between the IOMMU and
device, let's define those, but I don't see that doing so negates the
benefits of aggregated dirty bitmaps in the IOMMU context.  
Okay. As far as your request to document things as we seem them
upcoming I belive we should have some idea how dirty tracking control
fits in. I agree that is not related to how the bitmap is reported. We
will continue to think about dirty tracking as not connected to
SAVING.
Ok.
 
quoted
quoted
This creates inefficiencies in the kernel, we copy from the mlx5
formed data structure to new memory in the iommu through a very
ineffficent API and then again we do an ioctl to copy it once more and
throw all the extra work away. It does not seem good for something
where we want performance.  
So maybe the dirty bitmaps for the IOMMU context need to be exposed to
and directly modifiable by the drivers using atomic bitmap ops.  Maybe
those same bitmaps can be mmap'd to userspace.  These bitmaps are not
insignificant, do we want every driver managing their own copies?  
If we look at mlx5 for example there is no choice. The dirty log is in
some device specific format and is not sharable. We must allocate
memory to work with it.

What I don't need is the bitmap memory in the iommu container, that is
all useless for mlx5.

So, down this path we need some API for the iommu context to not
allocate its memory at all and refer to storage from the tracking
provider for cases where that makes sense.
That depends whether there are other devices in the context and if the
container dirty context is meant to include all devices or if the
driver is opt'ing out of the shared tracking... if that's a thing.
Alternatively, drivers could register callbacks to report their dirty
pages into the shared tracking for ranges requested by the user.  We
need to figure out how per-device tracking an system IOMMU tracking get
along if that's where we're headed.
 
quoted
quoted
Userspace has to do this anyhow if it has configurations with multiple
containers. For instance because it was forced to split the containers
due to one device not supporting NDMA.  
Huh?  When did that become a requirement?    
It is not a requirement, it is something userspace can do, if it
wants. And we talked about this, if NDMA isn't supported the P2P can't
work, and a way to enforce that is to not include P2P in the IOMMU
mapping. Except if you have an asymmetric NDMA you may want an
asymmetric IOMMU mapping too where the NDMA devices can do P2P and the
others don't. That is two containers and two dirty bitmaps.
I'm not sure how I was supposed to infer that use case from "...forced
to split the containers due to one device not support NDMA".  That's
certainly an option, but not a requirement.  In fact, if QEMU were to
do something like that, then we have some devices that can do p2p and
some devices that cannot... all or none seems like a much more
deterministic choice for QEMU.  How do guest drivers currently test p2p?
 
Who knows, it isn't the job of the kernel to make these choices, the
kernel just provides tools.
Agreed, but I don't see that userspace choosing to use separate
contexts either negates the value of the kernel aggregating dirty pages
within a container or clearly makes the case for per-device dirty pages.
quoted
I feel like there are a lot of excuses listed here, but nothing that
really demands a per device interface,   
"excuses" and "NIH" is a bit rude Alex.

From my side, you are asking for a lot work from us (who else?) to
define and implement a wack of missing kernel functionality.

I think it is very reasonable to ask what the return to the community
is for this work. "makes more sense to me" is not something that
is really compelling.

So, I'd rather you tell me concretely why doing this work, in this
way, is a good idea.
On my end, we have a defined dirty page tracking interface at the IOMMU
context with some, admittedly rough, plans on how we intend to evolve
that interface for both device level and IOMMU level tracking feeding
into a shared dirty page interfaces.  The example scenarios presented
mostly seem to have solutions within that design framework if we put a
little effort into it.  There are unanswered questions down either
path, but one of those paths is the path we previously chose.  The
greater burden is to switch paths, so rather I need to understand why
this is the better path, with due diligence to explore what the same
looks like with the current design.

It's only finally here in the thread that we're seeing some of the mlx5
implementation details that might favor a per-device solution, hints
that per device page granularity might be useful, and maybe that
exposing per-device dirty page footprints to userspace is underlying
this change of course.
quoted
quoted
If the migration driver says it supports tracking, then it only tracks
DMA from that device.  
I don't see what this buys us.  Userspace is only going to do a
migration if all devices support the per device migration region.  At
that point we need the best representation of the dirty bitmap we can
provide per IOMMU context.  It makes sense to me to aggregate per
device dirtying into that one context.  
Again, there are policy choices userspace can make, like just
suspending the device that doesn't do dirty tracking and continuing to
dirty track the ones that do.

This might be very logical if a non-dirty tracking device is something
like IDXD that will just cause higher request latency and the dirty
tracking is mlx5 that will cause externally visable artifacts.

My point is *we don't know* what people will want.

I also think you are too focused on a one-size solution that fits into
a qemu sort of enteprise product. While I can agree with your position
relative to an enterprise style customer, NVIDIA has different
customers, many that have their own customized VMMs that are tuned for
their HW choices. For these customers I do like to see that the kernel
allows options, and I don't think it is appropriate to be so
dismissive of this perspective.
So provide the justification I asked for previously and quoted above,
what are the things we want to be able to tune that cannot be done
through reasonable extensions of the current design?  I'm not trying to
be dismissive, I'm lacking facts and evidence of due diligence that the
current design is incapable of meeting our goals.
 
quoted
quoted
What I see is a lot of questions and limitations with this
approach. If we stick to funneling everything through the iommu then
answering the questions seem to create a large amount of kernel
work. Enough to ask if it is worthwhile..  
If we need a common userspace IOMMU subsystem like IOMMUfd that can
handle driver page pinning, IOMMU faults, and dirty tracking, why does
it suddenly become an unbearable burden to allow other means besides
page pinning for a driver to relay DMA page writes?  
I can see some concrete reasons for iommufd, like it allows this code
to be shared between mutliple subsystems that need it. Its design
supports more functionality than the vfio container can.

Here, I don't quite see it. If userspace does

  for (i = 0; i != num_trackers; i++)
     ioctl(merge dirty bitmap, i, &bitmap)

Or
   ioctl(read diry bitmap, &bitmap)

Hardly seems decisive.
On the same order as "makes more sense to me" ;)
What bothers me is the overhead and kernel
complexity.

If we split them it looks quite simple:
 - new ioctl on vfio device to read & clear dirty bitmap
    & extension flag to show this is supported
 - DIRTY_TRACK migration state bit to start/stop
Is this another device_state bit?
   Simple logic that read is only possible in NDMA/!RUNNING
 - new ioctl on vfio device to report dirty tracking capability flags:
    - internal dirty track supported
    - real dirty track through attached container supported
      (only mdev can set this today)
How does system IOMMU dirty tracking work?
 
Scenarios:

a. If userspace has a mlx5 and a mdev then it does a for loop as above to
   start and read the dirty bitmaps.

b. If userspace has only mlx5 it reads one bitmap

c. If userspace has mlx5 and some other PCI device it can activate
   mlx5 and leave the container alone. Suspend the PCI device early.
   Or directly give up on dirty track

For funneling through the container ioctls.. Humm:
 - Still track per device/container connection if internal/external
   dirty track is supported. Add an new ioctl so userspace can have
   this info for (c)
 - For external dirty track have some ops callbacks for start/stop,
   read bitmap and clear. (So, no migration state flag?)
 - On start make the policy choice if the external/internal will be
   used, then negotiate some uniform tracking size and iterate over
   all externals to call start
 - On read.. to avoid overheads iterate over the internal bitmap
   and read ops on all external bitmaps and or them together
   then copy to user. Just ignore NDMA and rely on userspace to
   do it right?
 - Clear iterates and zeros bitmaps
 - Change logic to not allocate tracking bitmaps if no mdevs
I don't understand what's being described here, I'm glad an attempt is
being made to see what this might look like with the current interface,
but at the same time the outline seems biased towards a complicated
portrayal.

It's a relatively simple progression, with no outside information the
container exposes all mapped pages as dirty.  Individual devices in the
container can expose themselves as enlightened in various ways,
initially we mark devices that pin pages as enlightened and when all
devices in the container are enlightened we can reduce the reported
bitmap.  As devices add their own dirty tracking we can add interfaces
to allow devices to register their status, mark dirty pages, and
potentially poll devices to update their dirty pages against ranges
requested by the user.  A device that's suspended, for example in PCI
D3 state, might mark itself as becoming enlightened and report no page
dirtying, a device ioctl might allow a more explicit instruction for
the same.  As IOMMU dirty page tracking comes online, the IOMMU can
essentially enlighten all the devices attached to the context.
 
a. works the same, kernel turns on both trackers
b. works the same, kernel turns on only mlx5
c. Hum. We have to disable the 'no tracker report everything as
   dirty' feature somehow so we can access only the mlx5 tracker
   without forcing evey page seen as dirty. Needs a details
Doesn't seem as complicated in my rendition.
 
And figure out how this relates to the ongoing iommufd project (which,
BTW, we have now invested a lot in too).

I'm not as convinced as you are that the 2nd is obviously better, and
on principle I don't like avoidable policy choices baked in the
kernel.
Userspace has the same degree of policy decision in my version, what's
missing from my version is userspace visibility to the per device dirty
footprint and the newly mentioned desire for per-device page granularity.
 
And I don't see that it makes userspace really much simpler anyhow. On
the other hand it looks like a lot more kernel work..
There's kernel work either way.
quoted
OTOH, aggregating these features in the IOMMU reduces both overhead
of per device bitmaps and user operations to create their own
consolidated view.  
I don't understand this, funneling will not reduce overhead, at best
with some work we can almost break even by not allocating the SW
bitmaps.
The moment we have more than one device that requires a bitmap, where
the device doesn't have the visibility of the extent of the bitmap, we
introduce both space and time overhead versus a shared bitmap that can
be pre-allocated.

What's being asked for here is a change from the current plan of
record, that entails work and justification, including tangible
benefits versus a fair exploration of how the same might work in the
current design.

Anyway, as Connie mentioned off-list, we're deep into a comment thread
that's becoming more and more tangential to the original patch, I'm not
sure how many relevant contributors are still following this, and it's
probably best to propose a change in course to dirty bitmap tracking in
a new thread.  Thanks,

Alex
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help