Thread (49 messages) 49 messages, 6 authors, 2021-01-19

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

From: Leon Romanovsky <leon@kernel.org>
Date: 2021-01-14 17:27:27
Also in: linux-pci, linux-rdma

On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
On Wed, Jan 13, 2021 at 10:50 PM Leon Romanovsky [off-list ref] wrote:
quoted
On Wed, Jan 13, 2021 at 02:44:45PM -0800, Alexander Duyck wrote:
quoted
On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky [off-list ref] wrote:
quoted
On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
quoted
On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky [off-list ref] wrote:
quoted
On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
quoted
On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky [off-list ref] wrote:
quoted
From: Leon Romanovsky <leonro@nvidia.com>

Some SR-IOV capable devices provide an ability to configure specific
number of MSI-X vectors on their VF prior driver is probed on that VF.

In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors to consume by the VFs

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
 drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
 drivers/pci/pci.h                       |  3 +++
 include/linux/pci.h                     |  2 ++
 4 files changed, 50 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 05e26e5da54e..64e9b700acc9 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -395,3 +395,17 @@ Description:
                The file is writable if the PF is bound to a driver that
                supports the ->sriov_set_msix_vec_count() callback and there
                is no driver bound to the VF.
+
+What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
In this case I would drop the "vf" and just go with sriov_total_msix
since now you are referring to a global value instead of a per VF
value.
This field indicates the amount of MSI-X available for VFs, it doesn't
include PFs. The missing "_vf_" will mislead users who will believe that
it is all MSI-X vectors available for this device. They will need to take
into consideration amount of PF MSI-X in order to calculate the VF distribution.

So I would leave "_vf_" here.
The problem is you aren't indicating how many are available for an
individual VF though, you are indicating how many are available for
use by SR-IOV to give to the VFs. The fact that you are dealing with a
pool makes things confusing in my opinion. For example sriov_vf_device
describes the device ID that will be given to each VF.
sriov_vf_device is different and is implemented accordingly to the PCI
spec, 9.3.3.11 VF Device ID (Offset 1Ah)
"This field contains the Device ID that should be presented for every VF
to the SI."

It is one ID for all VFs.
Yes, but that is what I am getting at. It is also what the device
configuration will be for one VF. So when I read sriov_vf_total_msix
it reads as the total for a single VF, not all of the the VFs. That is
why I think dropping the "vf_" part of the name would make sense, as
what you are describing is the total number of MSI-X vectors for use
by SR-IOV VFs.
I can change to anything as long as new name will give clear indication
that this total number is for VFs and doesn't include SR-IOV PF MSI-X.
It is interesting that you make that distinction.

So in the case of the Intel hardware we had one pool of MSI-X
interrupts that was available for the entire port, both PF and VF.
When we enabled SR-IOV we had to repartition that pool in order to
assign interrupts to devices. So it sounds like in your case you don't
do that and instead the PF is static and the VFs are the only piece
that is flexible. Do I have that correct?
It is partially correct. The mlx5 devices have ability to change MSI-X
vectors of PF too, but to do so, you will need driver reload and much
more complex user interface. So we (SW) leave it as static.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+Date:          January 2021
+Contact:       Leon Romanovsky [off-list ref]
+Description:
+               This file is associated with the SR-IOV PFs.
+               It returns a total number of possible to configure MSI-X
+               vectors on the enabled VFs.
+
+               The values returned are:
+                * > 0 - this will be total number possible to consume by VFs,
+                * = 0 - feature is not supported
+
+               If no SR-IOV VFs are enabled, this value will return 0.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 42c0df4158d1..0a6ddf3230fd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
        return count;
 }

+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+                                       struct device_attribute *attr,
+                                       char *buf)
+{
+       struct pci_dev *pdev = to_pci_dev(dev);
+
+       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
+}
+
You display it as a signed value, but unsigned values are not
supported, correct?
Right, I made it similar to the vf_msix_set. I can change.
quoted
quoted
 static DEVICE_ATTR_RO(sriov_totalvfs);
 static DEVICE_ATTR_RW(sriov_numvfs);
 static DEVICE_ATTR_RO(sriov_offset);
 static DEVICE_ATTR_RO(sriov_stride);
 static DEVICE_ATTR_RO(sriov_vf_device);
 static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
+static DEVICE_ATTR_RO(sriov_vf_total_msix);

 static struct attribute *sriov_dev_attrs[] = {
        &dev_attr_sriov_totalvfs.attr,
@@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
        &dev_attr_sriov_stride.attr,
        &dev_attr_sriov_vf_device.attr,
        &dev_attr_sriov_drivers_autoprobe.attr,
+       &dev_attr_sriov_vf_total_msix.attr,
        NULL,
 };
@@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
                sysfs_remove_link(&dev->dev.kobj, "dep_link");

        iov->num_VFs = 0;
+       iov->vf_total_msix = 0;
        pci_iov_set_numvfs(dev, 0);
 }
@@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);

+/**
+ * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
+ * @dev: the PCI PF device
+ * @numb: the total number of MSI-X vector to consume by the VFs
+ *
+ * Sets the number of MSI-X vectors that is possible to consume by the VFs.
+ * This interface is complimentary part of the pci_set_msix_vec_count()
+ * that will be used to configure the required number on the VF.
+ */
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
+{
+       if (!dev->is_physfn || !dev->driver ||
+           !dev->driver->sriov_set_msix_vec_count)
+               return;
+
+       dev->sriov->vf_total_msix = numb;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
+
This seems broken. What validation is being done on the numb value?
You pass it as int, and your documentation all refers to tests for >=
0, but isn't a signed input a possibility as well? Also "numb" doesn't
make for a good abbreviation as it is already a word of its own. It
might make more sense to use count or something like that rather than
trying to abbreviate number.
"Broken" is a nice word to describe misunderstanding.
Would you prefer "lacking input validation".

I see all this code in there checking for is_physfn and driver and
sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
It just seems like a lot of validation is taking place on the wrong
things if you are just going to be setting a value reporting the total
number of MSI-X vectors in use for SR-IOV.
All those checks are in place to ensure that we are not overwriting the
default value, which is 0.
Okay, so what you really have is surplus interrupts that you are
wanting to give out to VF devices. So when we indicate 0 here as the
default it really means we have no additional interrupts to give out.
Am I understanding that correctly?
The vf_total_msix is static value and shouldn't be recalculated after
every MSI-X vector number change. So 0 means that driver doesn't support
at all this feature. The operator is responsible to make proper assignment
calculations, because he is already doing it for the CPUs and netdev queues.
Honestly that makes things even uglier. So basically this is a feature
where if it isn't supported it will make it look like the SR-IOV
device doesn't support MSI-X. I realize it is just the way it is
worded but it isn't very pretty.
I'm not native English speaker, we can work together later on the
Documentation to make it more clear.
quoted
quoted
The problem is this is very vendor specific so I am doing my best to
understand it as it is different then the other NICs I have worked
with.
There is nothing vendor specific here. There are two types of devices:
1. Support this feature. - The vf_total_msix will be greater than 0 for them
and their FW will do sanity checks when user overwrites their default number
that they sat in the VF creation stage.
2. Doesn't support this feature - The vf_total_msix will be 0.

It is PCI spec, so those "other NICs" that didn't implement the PCI spec
will stay with option #2. It is not different from current situation.
Where in the spec is this?

I know in the PCI spec it says that the MSI-X table size is read-only
and is not supposed to be written by system software. That is what is
being overwritten right now by your patches that has me concerned.
My patches are not over-writting anything, they are asking from FW to
set more appropriate value. The field stays read-only.

<...>
quoted
I remind you that this feature is applicable to all SR-IOV devices, we have
RDMA, NVMe, crypto, FPGA and netdev VFs. The devlink is not supported
outside of netdev world and implementation there will make this feature
is far from being usable.
To quote the documentation:
"devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration."
There is a great world outside of netdev and it doesn't include devlink. :)
quoted
Right now, the configuration of PCI/core is done through sysfs, so let's
review this feature from PCI/core perspective and not from netdev where
sysfs is not widely used.
The problem is what you are configuring is not necessarily PCI device
specific. You are configuring the firmware which operates at a level
above the PCI device. You just have it manifesting as a PCI behavior
as you are editing a read-only configuration space field.
In our devices, PCI config space is managed by FW and it represents HW.
Also as I mentioned a few times now the approach you are taking
violates the PCIe spec by essentially providing a means of making a
read-only field writable.
AS you said, we see the same picture differently.

Thansk
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help