Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table... | netdev

Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

From: Leon Romanovsky <leon@kernel.org>
Date: 2021-02-05 17:39:57
Also in: linux-pci, linux-rdma

On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:

On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:

quoted

On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:

quoted

On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:

quoted

On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:

quoted

On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:

quoted

From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows
configure the number of MSI-X vectors for specific SR-IO VF.
This is needed to optimize the performance of newly bound
devices by allocating the number of vectors based on the
administrator knowledge of targeted VM.

I'm reading between the lines here, but IIUC the point is that you
have a PF that supports a finite number of MSI-X vectors for use
by all the VFs, and this interface is to control the distribution
of those MSI-X vectors among the VFs.

quoted

This commit log should describe how *this* device manages this
allocation and how the PF Table Size and the VF Table Sizes are
related.  Per PCIe, there is no necessary connection between them.

There is no connection in mlx5 devices either. PF is used as a vehicle
to access VF that doesn't have driver yet. From "table size" perspective
they completely independent, because PF already probed by driver and
it is already too late to change it.

So PF table size is static and can be changed through FW utility only.

This is where description of the device would be useful.

The fact that you need "sriov_vf_total_msix" to advertise how many
vectors are available and "sriov_vf_msix_count" to influence how they
are distributed across the VFs suggests that these Table Sizes are not
completely independent.

Can a VF have a bigger Table Size than the PF does?  Can all the VF
Table Sizes added together be bigger than the PF Table Size?  If VF A
has a larger Table Size, does that mean VF B must have a smaller Table
Size?

VFs are completely independent devices and their table size can be
bigger than PF. FW has two pools, one for PF and another for all VFs.
In real world scenarios, every VF will have more MSI-X vectors than PF,
which will be distributed by orchestration software.

In theory, users can assign almost all vectors to one VF and leave others
depleted. In practice, it is not possible, we ensure that all VFs start
with some sensible number of vectors and FW protects with check of max
vector size write.

Obviously I do not understand the details about how this device works.
It would be helpful to have those details here.

Here's the sequence as I understand it:

  1) PF driver binds to PF
  2) PF driver enables VFs
  3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
  4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
  5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
  6) VF driver binds to VF
  7) VF reads MSI-X Message Control (Table Size)

Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
different Table Sizes?  That would be a little weird.

Yes, this is the flow. I think differently from you and think this
is actual good thing that user writes new msix count and it is shown
immediately.

I'm also a little concerned about doing 2 before 3 & 4.  That works
for mlx5 because implements the Table Size adjustment in a way that
works *after* the VFs have been enabled.

It is not related to mlx5, but to the PCI spec that requires us to
create all VFs at the same time. Before enabling VFs, they don't
exist.

But it seems conceivable that a device could implement vector
distribution in a way that would require the VF Table Sizes to be
fixed *before* enabling VFs.  That would be nice in the sense that the
VFs would be created "fully formed" and the VF Table Size would be
completely read-only as documented.

It is not how SR-IOV is used in real world. Cloud providers create many
VFs but don't assign those VFs yet. They do it after customer request
VM with specific properties (number of CPUs) which means number of MSI-X
vectors in our case.

All this is done when other VFs already in use and we can't destroy and
recreate them at that point of time.

The other knob idea you mentioned at [2]:

  echo "0000:01:00.2 123" > sriov_vf_msix_count

This knob doesn't always work if you have many devices and it is
nightmare to guess "new" VF BDF before it is claimed.

As an example on my machine with two devices, VFs are created differently:
[root@c ~]$ lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:01.2 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:01.3 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
[root@c ~]# echo 0 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

would have the advantage of working for both cases.  That's definitely
more complicated, but at the same time, I would hate to carve a sysfs
interface into stone if it might not work for other devices.

As you can see above, it is not really solves pre-creation configuration flow.
For such flows, probably devlink is the best choice.

quoted

+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky [off-list ref]
+Description:
+		This file is associated with the SR-IOV PFs.
+		It returns a total number of possible to configure MSI-X
+		vectors on the enabled VFs.
+
+		The values returned are:
+		 * > 0 - this will be total number possible to consume by VFs,
+		 * = 0 - feature is not supported

quoted

Not sure the "= 0" description is necessary here.  If the value
returned is the number of MSI-X vectors available for assignment to
VFs, "0" is a perfectly legitimate value.  It just means there are
none.  It doesn't need to be described separately.

I wanted to help users and remove ambiguity. For example, mlx5 drivers
will always implement proper .set_...() callbacks but for some devices
without needed FW support, the value will be 0. Instead of misleading
users with wrong promise that feature supported but doesn't have
available vectors, I decided to be more clear. For the users, 0 means, don't
try, it is not working.

Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
devices"?  I totally missed that; I thought you meant "not supported
by the PF driver."  Why not have the PF driver detect that the
firmware doesn't support the feature and just not expose the sysfs
files at all in that case?

I can do it and will do it, but this behavior will be possible because
mlx5 is flexible enough. I imagine that other device won't be so flexible,
for example they will decide to "close" this feature because of other
SW limitation.

quoted

If we put them in a new "vfs_overlay" directory, it seems like
overkill to repeat the "vf" part, but I'm hoping the new files can end
up next to these existing files.  In that case, I think it makes sense
to include "sriov".  And it probably does make sense to include "vf"
as well.

I put everything in folder to group any possible future extensions.
Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
separate folder.

I'm not convinced (yet) that the possibility of future extensions is
enough justification for adding the "vfs_overlay" directory.  It
really complicates the code flow -- if we skipped the new directory,
I'm pretty sure we could make .is_visible() work, which would be a
major simplification.

Unfortunately not, is_visible() is not dynamic and called when device
creates sysfs and it doesn't rescan it or refresh them. It means that
all files from vfs_overlay folder will exist even driver is removed
from PF.

Also sysfs is created before driver is probed.

And there's quite a bit of value in the new files being right next to
the existing sriov_* files.

I have no problem to drop folder, just need clear request, should I.

quoted

+void pci_disable_vfs_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return;
+
+	id = dev->sriov->num_VFs;
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
+	}
+	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+}
+EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);

I'm not convinced all this sysfs wrangling is necessary.  If it is,
add a hint in a comment about why this is special and can't use
something like sriov_dev_attr_group.

This makes the overlay to be PF-driven. Alexander insisted on this flow.

If you're referring to [1], I think "insisted on this flow" might be
an overstatement of what Alex was looking for.  IIUC Alex wants the
sysfs files to be visible only when they're useful, i.e., when a
driver implements ->sriov_set_msix_vec_count().

It is only one side of the request, the sysfs files shouldn't be visible
if PF driver was removed and visible again when its probed again.

I can't parse this, but it's probably related to the question below.

quoted

That seems reasonable and also seems like something a smarter
.is_visible() function could accomplish without having drivers call
pci_enable_vfs_overlay(), e.g., maybe some variation of this:

  static umode_t sriov_vf_attrs_are_visible(...)
  {
    if (!pdev->msix_cap || dev_is_pf(dev))
      return 0;

    pf = pci_physfn(pdev);
    if (pf->driver && pf->driver->sriov_set_msix_vec_count)
      return a->mode;

    return 0;
  }

It doesn't work with the following flow:
1. load driver
2. disable autoprobe
3. echo to sriov_numvfs
.... <--- you have this sriov_vf_attrs_are_visible() created
4. unload driver
.... <--- sysfs still exists despite not having PF driver.

I missed your point here, sorry.  After unloading the PF driver,
"pf->driver" in the sketch above will be NULL, so the VF sysfs file
would not be visible.  Right?  Maybe it has to do with autoprobe?  I
didn't catch what the significance of disabling autoprobe was.

No, is_visible() is called on file creation only, see fs/sysfs/group.c:
create_files(). After you remove driver, there is no sysfs file refresh.

quoted

+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
+{
+	if (!dev->is_physfn)
+		return;
+
+	dev->sriov->vf_total_msix = count;

The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
mlx5, calls this, and all the PCI core does is hang onto the value and
expose it via sysfs.  I think I'd rather have a callback in struct
pci_driver and let the driver supply the value when needed.  I.e.,
sriov_vf_total_msix_show() would call the callback instead of looking
at pdev->sriov->vf_total_msix.

It will cause to unnecessary locking to ensure that driver doesn't
vanish during sysfs read. I can change, but don't think that it is right
decision.

Doesn't sysfs already ensure the driver can't vanish while we're
executing a DEVICE_ATTR accessor?

It is not, you can see it by adding device_lock_held() check in any
.show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
do much. This is why pci_vf_set_msix_vec_count() has double lock.

Aahh, right, I learned something today, thanks!  There are only a few
PCI sysfs attributes that reference the driver, and they do their own
locking.

I do think vf_total_msix is a bit of driver state related to
implementing this functionality and doesn't need to be in the PCI
core.  I think device locking is acceptable; it's very similar to what
is done in sriov_numvfs_store().  Doing the locking and calling a
driver callback makes it obvious that vf_total_msix is part of this PF
driver-specific functionality, not a generic part of the PCI core.

So let's give it a try.  If it turns out to be terrible, we can
revisit it.

No problem.

quoted

Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
no actual *reason* for it to be there other than the fact that it has
"msix" in the name.  It uses no MSI data structures.  Maybe it could
be folded into sriov_vf_msix_count_store(), which would make the
analysis even easier.

I put _set_ command near _get_ command, but I can move it to iov.c

You mean you put pci_vf_set_msix_vec_count() near
pci_msix_vec_count()?  That's *true*, but they are not analogues, and
one is not setting the value returned by the other.

pci_vf_set_msix_vec_count() is a completely magical thing that uses a
device-specific mechanism on a PF that happens to change what
pci_msix_vec_count() on a VF will return later.  I think this is more
related to SR-IOV than it is to MSI.

I will do.

Bjorn

quoted

[1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com (local)

[2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/ (local)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help