Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
From: Leon Romanovsky <leon@kernel.org>
Date: 2021-02-05 17:39:57
Also in:
linux-pci, linux-rdma
On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:quoted
On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:quoted
On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:quoted
On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:quoted
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:quoted
From: Leon Romanovsky <leonro@nvidia.com> Extend PCI sysfs interface with a new callback that allows configure the number of MSI-X vectors for specific SR-IO VF. This is needed to optimize the performance of newly bound devices by allocating the number of vectors based on the administrator knowledge of targeted VM.I'm reading between the lines here, but IIUC the point is that you have a PF that supports a finite number of MSI-X vectors for use by all the VFs, and this interface is to control the distribution of those MSI-X vectors among the VFs.quoted
quoted
This commit log should describe how *this* device manages this allocation and how the PF Table Size and the VF Table Sizes are related. Per PCIe, there is no necessary connection between them.There is no connection in mlx5 devices either. PF is used as a vehicle to access VF that doesn't have driver yet. From "table size" perspective they completely independent, because PF already probed by driver and it is already too late to change it. So PF table size is static and can be changed through FW utility only.This is where description of the device would be useful. The fact that you need "sriov_vf_total_msix" to advertise how many vectors are available and "sriov_vf_msix_count" to influence how they are distributed across the VFs suggests that these Table Sizes are not completely independent. Can a VF have a bigger Table Size than the PF does? Can all the VF Table Sizes added together be bigger than the PF Table Size? If VF A has a larger Table Size, does that mean VF B must have a smaller Table Size?
VFs are completely independent devices and their table size can be bigger than PF. FW has two pools, one for PF and another for all VFs. In real world scenarios, every VF will have more MSI-X vectors than PF, which will be distributed by orchestration software. In theory, users can assign almost all vectors to one VF and leave others depleted. In practice, it is not possible, we ensure that all VFs start with some sensible number of vectors and FW protects with check of max vector size write.
Obviously I do not understand the details about how this device works. It would be helpful to have those details here. Here's the sequence as I understand it: 1) PF driver binds to PF 2) PF driver enables VFs 3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix 4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count 6) VF driver binds to VF 7) VF reads MSI-X Message Control (Table Size) Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read different Table Sizes? That would be a little weird.
Yes, this is the flow. I think differently from you and think this is actual good thing that user writes new msix count and it is shown immediately.
I'm also a little concerned about doing 2 before 3 & 4. That works for mlx5 because implements the Table Size adjustment in a way that works *after* the VFs have been enabled.
It is not related to mlx5, but to the PCI spec that requires us to create all VFs at the same time. Before enabling VFs, they don't exist.
But it seems conceivable that a device could implement vector distribution in a way that would require the VF Table Sizes to be fixed *before* enabling VFs. That would be nice in the sense that the VFs would be created "fully formed" and the VF Table Size would be completely read-only as documented.
It is not how SR-IOV is used in real world. Cloud providers create many VFs but don't assign those VFs yet. They do it after customer request VM with specific properties (number of CPUs) which means number of MSI-X vectors in our case. All this is done when other VFs already in use and we can't destroy and recreate them at that point of time.
The other knob idea you mentioned at [2]: echo "0000:01:00.2 123" > sriov_vf_msix_count
This knob doesn't always work if you have many devices and it is nightmare to guess "new" VF BDF before it is claimed. As an example on my machine with two devices, VFs are created differently: [root@c ~]$ lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] [root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs [root@c ~]# lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:01.2 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 08:01.3 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] [root@c ~]# echo 0 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs [root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs [root@c ~]# lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 08:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
would have the advantage of working for both cases. That's definitely more complicated, but at the same time, I would hate to carve a sysfs interface into stone if it might not work for other devices.
As you can see above, it is not really solves pre-creation configuration flow. For such flows, probably devlink is the best choice.
quoted
quoted
quoted
quoted
quoted
+What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix +Date: January 2021 +Contact: Leon Romanovsky [off-list ref] +Description: + This file is associated with the SR-IOV PFs. + It returns a total number of possible to configure MSI-X + vectors on the enabled VFs. + + The values returned are: + * > 0 - this will be total number possible to consume by VFs, + * = 0 - feature is not supportedquoted
Not sure the "= 0" description is necessary here. If the value returned is the number of MSI-X vectors available for assignment to VFs, "0" is a perfectly legitimate value. It just means there are none. It doesn't need to be described separately.I wanted to help users and remove ambiguity. For example, mlx5 drivers will always implement proper .set_...() callbacks but for some devices without needed FW support, the value will be 0. Instead of misleading users with wrong promise that feature supported but doesn't have available vectors, I decided to be more clear. For the users, 0 means, don't try, it is not working.Oh, you mean "feature is not supported by the FIRMWARE on some mlx5 devices"? I totally missed that; I thought you meant "not supported by the PF driver." Why not have the PF driver detect that the firmware doesn't support the feature and just not expose the sysfs files at all in that case?
I can do it and will do it, but this behavior will be possible because mlx5 is flexible enough. I imagine that other device won't be so flexible, for example they will decide to "close" this feature because of other SW limitation.
quoted
quoted
If we put them in a new "vfs_overlay" directory, it seems like overkill to repeat the "vf" part, but I'm hoping the new files can end up next to these existing files. In that case, I think it makes sense to include "sriov". And it probably does make sense to include "vf" as well.I put everything in folder to group any possible future extensions. Those extensions are applicable to SR-IOV VFs only, IMHO they deserve separate folder.I'm not convinced (yet) that the possibility of future extensions is enough justification for adding the "vfs_overlay" directory. It really complicates the code flow -- if we skipped the new directory, I'm pretty sure we could make .is_visible() work, which would be a major simplification.
Unfortunately not, is_visible() is not dynamic and called when device creates sysfs and it doesn't rescan it or refresh them. It means that all files from vfs_overlay folder will exist even driver is removed from PF. Also sysfs is created before driver is probed.
And there's quite a bit of value in the new files being right next to the existing sriov_* files.
I have no problem to drop folder, just need clear request, should I.
quoted
quoted
quoted
quoted
quoted
+void pci_disable_vfs_overlay(struct pci_dev *dev) +{ + struct pci_dev *virtfn; + int id; + + if (!dev->is_physfn || !dev->sriov->num_VFs) + return; + + id = dev->sriov->num_VFs; + while (id--) { + virtfn = pci_get_domain_bus_and_slot( + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), + pci_iov_virtfn_devfn(dev, id)); + + if (!virtfn) + continue; + + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); + } + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); +} +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);I'm not convinced all this sysfs wrangling is necessary. If it is, add a hint in a comment about why this is special and can't use something like sriov_dev_attr_group.This makes the overlay to be PF-driven. Alexander insisted on this flow.If you're referring to [1], I think "insisted on this flow" might be an overstatement of what Alex was looking for. IIUC Alex wants the sysfs files to be visible only when they're useful, i.e., when a driver implements ->sriov_set_msix_vec_count().It is only one side of the request, the sysfs files shouldn't be visible if PF driver was removed and visible again when its probed again.I can't parse this, but it's probably related to the question below.quoted
quoted
That seems reasonable and also seems like something a smarter .is_visible() function could accomplish without having drivers call pci_enable_vfs_overlay(), e.g., maybe some variation of this: static umode_t sriov_vf_attrs_are_visible(...) { if (!pdev->msix_cap || dev_is_pf(dev)) return 0; pf = pci_physfn(pdev); if (pf->driver && pf->driver->sriov_set_msix_vec_count) return a->mode; return 0; }It doesn't work with the following flow: 1. load driver 2. disable autoprobe 3. echo to sriov_numvfs .... <--- you have this sriov_vf_attrs_are_visible() created 4. unload driver .... <--- sysfs still exists despite not having PF driver.I missed your point here, sorry. After unloading the PF driver, "pf->driver" in the sketch above will be NULL, so the VF sysfs file would not be visible. Right? Maybe it has to do with autoprobe? I didn't catch what the significance of disabling autoprobe was.
No, is_visible() is called on file creation only, see fs/sysfs/group.c: create_files(). After you remove driver, there is no sysfs file refresh.
quoted
quoted
quoted
quoted
quoted
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) +{ + if (!dev->is_physfn) + return; + + dev->sriov->vf_total_msix = count;The PCI core doesn't use vf_total_msix at all. The driver, e.g., mlx5, calls this, and all the PCI core does is hang onto the value and expose it via sysfs. I think I'd rather have a callback in struct pci_driver and let the driver supply the value when needed. I.e., sriov_vf_total_msix_show() would call the callback instead of looking at pdev->sriov->vf_total_msix.It will cause to unnecessary locking to ensure that driver doesn't vanish during sysfs read. I can change, but don't think that it is right decision.Doesn't sysfs already ensure the driver can't vanish while we're executing a DEVICE_ATTR accessor?It is not, you can see it by adding device_lock_held() check in any .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't do much. This is why pci_vf_set_msix_vec_count() has double lock.Aahh, right, I learned something today, thanks! There are only a few PCI sysfs attributes that reference the driver, and they do their own locking. I do think vf_total_msix is a bit of driver state related to implementing this functionality and doesn't need to be in the PCI core. I think device locking is acceptable; it's very similar to what is done in sriov_numvfs_store(). Doing the locking and calling a driver callback makes it obvious that vf_total_msix is part of this PF driver-specific functionality, not a generic part of the PCI core. So let's give it a try. If it turns out to be terrible, we can revisit it.
No problem.
quoted
quoted
Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's no actual *reason* for it to be there other than the fact that it has "msix" in the name. It uses no MSI data structures. Maybe it could be folded into sriov_vf_msix_count_store(), which would make the analysis even easier.I put _set_ command near _get_ command, but I can move it to iov.cYou mean you put pci_vf_set_msix_vec_count() near pci_msix_vec_count()? That's *true*, but they are not analogues, and one is not setting the value returned by the other. pci_vf_set_msix_vec_count() is a completely magical thing that uses a device-specific mechanism on a PF that happens to change what pci_msix_vec_count() on a VF will return later. I think this is more related to SR-IOV than it is to MSI.
I will do.
Bjornquoted
[2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/ (local)