Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table... | netdev

Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

From: Leon Romanovsky <leon@kernel.org>
Date: 2021-02-06 12:43:50
Also in: linux-pci, linux-rdma

On Fri, Feb 05, 2021 at 04:57:03PM -0600, Bjorn Helgaas wrote:

On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote:

quoted

On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:

quoted

On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:

quoted

On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:

quoted

On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:

quoted

On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:

quoted

On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:

quoted

From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows
configure the number of MSI-X vectors for specific SR-IO VF.
This is needed to optimize the performance of newly bound
devices by allocating the number of vectors based on the
administrator knowledge of targeted VM.

I'm reading between the lines here, but IIUC the point is that you
have a PF that supports a finite number of MSI-X vectors for use
by all the VFs, and this interface is to control the distribution
of those MSI-X vectors among the VFs.

quoted

This commit log should describe how *this* device manages this
allocation and how the PF Table Size and the VF Table Sizes are
related.  Per PCIe, there is no necessary connection between them.

There is no connection in mlx5 devices either. PF is used as a vehicle
to access VF that doesn't have driver yet. From "table size" perspective
they completely independent, because PF already probed by driver and
it is already too late to change it.

So PF table size is static and can be changed through FW utility only.

This is where description of the device would be useful.

The fact that you need "sriov_vf_total_msix" to advertise how many
vectors are available and "sriov_vf_msix_count" to influence how they
are distributed across the VFs suggests that these Table Sizes are not
completely independent.

Can a VF have a bigger Table Size than the PF does?  Can all the VF
Table Sizes added together be bigger than the PF Table Size?  If VF A
has a larger Table Size, does that mean VF B must have a smaller Table
Size?

VFs are completely independent devices and their table size can be
bigger than PF. FW has two pools, one for PF and another for all VFs.
In real world scenarios, every VF will have more MSI-X vectors than PF,
which will be distributed by orchestration software.

Well, if the sum of all the VF Table Sizes cannot exceed the size of
the FW pool for VFs, I would say the VFs are not completely
independent.  Increasing the Table Size of one VF reduces it for other
VFs.

This is an essential detail because it's the whole reason behind this
interface, so sketching this out in the commit log will make this much
easier to understand.

I think that it is already written, but will recheck.

quoted

Here's the sequence as I understand it:

  1) PF driver binds to PF
  2) PF driver enables VFs
  3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
  4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
  5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
  6) VF driver binds to VF
  7) VF reads MSI-X Message Control (Table Size)

Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
different Table Sizes?  That would be a little weird.

Yes, this is the flow. I think differently from you and think this
is actual good thing that user writes new msix count and it is shown
immediately.

Only weird because per spec Table Size is read-only and in this
scenario it changes, so it may be surprising, but probably not a huge
deal.

quoted

I'm also a little concerned about doing 2 before 3 & 4.  That works
for mlx5 because implements the Table Size adjustment in a way that
works *after* the VFs have been enabled.

It is not related to mlx5, but to the PCI spec that requires us to
create all VFs at the same time. Before enabling VFs, they don't
exist.

Yes.  I can imagine a PF driver that collects characteristics for the
desired VFs before enabling them, sort of like we already collect the
*number* of VFs.  But I think your argument for dynamic changes after
creation below is pretty compelling.

IMHO, the best tool for such pre-configured changes will be devlink.

quoted

But it seems conceivable that a device could implement vector
distribution in a way that would require the VF Table Sizes to be
fixed *before* enabling VFs.  That would be nice in the sense that the
VFs would be created "fully formed" and the VF Table Size would be
completely read-only as documented.

It is not how SR-IOV is used in real world. Cloud providers create many
VFs but don't assign those VFs yet. They do it after customer request
VM with specific properties (number of CPUs) which means number of MSI-X
vectors in our case.

All this is done when other VFs already in use and we can't destroy and
recreate them at that point of time.

Makes sense, thanks for this insight.  The need to change the MSI-X
Table Size dynamically while other VFs are in use would also be useful
in the commit log because it really helps motivate this design.

It is already written, but I will add more info.

quoted

+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky [off-list ref]
+Description:
+		This file is associated with the SR-IOV PFs.
+		It returns a total number of possible to configure MSI-X
+		vectors on the enabled VFs.
+
+		The values returned are:
+		 * > 0 - this will be total number possible to consume by VFs,
+		 * = 0 - feature is not supported

quoted

Not sure the "= 0" description is necessary here.  If the value
returned is the number of MSI-X vectors available for assignment to
VFs, "0" is a perfectly legitimate value.  It just means there are
none.  It doesn't need to be described separately.

I wanted to help users and remove ambiguity. For example, mlx5 drivers
will always implement proper .set_...() callbacks but for some devices
without needed FW support, the value will be 0. Instead of misleading
users with wrong promise that feature supported but doesn't have
available vectors, I decided to be more clear. For the users, 0 means, don't
try, it is not working.

Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
devices"?  I totally missed that; I thought you meant "not supported
by the PF driver."  Why not have the PF driver detect that the
firmware doesn't support the feature and just not expose the sysfs
files at all in that case?

I can do it and will do it, but this behavior will be possible because
mlx5 is flexible enough. I imagine that other device won't be so flexible,
for example they will decide to "close" this feature because of other
SW limitation.

What about something like this?

  This file contains the total number of MSI-X vectors available for
  assignment to all VFs associated with this PF.  It may be zero if
  the device doesn't support this functionality.

Sounds good.

Side question: How does user-space use this?  This file contains a
constant, and there's no interface to learn how many vectors are still
available in the pool, right?

Right, it is easy for the kernel implementation and easy for the users.
They don't need from the kernel to see exact utilized number.

Every VF has vectors assigned from the beginning and there is no parallel
race between kernel and user space to claim resources in our flow, so at the
point when orchestration will have a chance to run the system will be in steady
state.

Access to the hypervisor with ability to write to sysfs files requires
root permissions and management software already counts number of users,
their CPUs and number of vectors.

I guess the user just has to manage the values written to
.../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds
sriov_vf_total_msix?  If that user app crashes, I guess it can
reconstruct this state by reading all the
.../VF<n>/sriov_vf_msix_count files?

Yes, if orchestration software crashes on hypervisor, it will simply
reiterate all VFs from the beginning.

quoted

If we put them in a new "vfs_overlay" directory, it seems like
overkill to repeat the "vf" part, but I'm hoping the new files can end
up next to these existing files.  In that case, I think it makes sense
to include "sriov".  And it probably does make sense to include "vf"
as well.

I put everything in folder to group any possible future extensions.
Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
separate folder.

I'm not convinced (yet) that the possibility of future extensions is
enough justification for adding the "vfs_overlay" directory.  It
really complicates the code flow -- if we skipped the new directory,
I'm pretty sure we could make .is_visible() work, which would be a
major simplification.

Unfortunately not, is_visible() is not dynamic and called when device
creates sysfs and it doesn't rescan it or refresh them. It means that
all files from vfs_overlay folder will exist even driver is removed
from PF.

Also sysfs is created before driver is probed.

Ah, both excellent points, that makes this much clearer to me, thank
you!

quoted

And there's quite a bit of value in the new files being right next to
the existing sriov_* files.

I have no problem to drop folder, just need clear request, should I.

I think we should drop the folder.  I missed the previous discussion
though, so if somebody has a good objection to dropping it, let me
know.

Just to be clear, dropping "vfs_overlay" folder won't remove any
complexity with pci_enable_vfs_overlay()/pci_disable_vfs_overlay(), but
will cause to simple change of the name attribute in the attribute_group.

Dropping the folder has the potential advantage that we *could* decide
to expose these files always, and just have them be non-functional if
the device doesn't support assignment or the PF driver isn't bound.
Then I think we could use static sysfs attributes without
pci_enable_vfs_overlay().

Such static initialization was in v0..v3 versions, but the request was to have
new files to be PF-driven and it requires enable/disable callbacks.

They need to allow the flow when PF driver is bounded after VF devices
already created. For example this flow:
1. modprobe mlx5_core
2. echo 5 > .../sriov_numvf
3. rmmod mlx5_core
...
4. modprobe mlx5_core <- need to reattach to already existing VFs.

Thanks

Bjorn

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help