Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
From: Bjorn Helgaas <helgaas@kernel.org>
Date: 2021-04-01 01:24:35
Also in:
linux-pci, linux-rdma
[+cc Rafael, in case you're interested in the driver core issue here] On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:quoted
On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:quoted
On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:quoted
On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:quoted
On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:quoted
I think I misunderstood Greg's subdirectory comment. We already have directories like this:Yes, IIRC, Greg's remark applies if you have to start creating directories with manual kobjects.quoted
and aspm_ctrl_attr_group (for "link") is nicely done with static attributes. So I think we could do something like this: /sys/bus/pci/devices/0000:01:00.0/ # PF directory sriov/ # SR-IOV related stuff vf_total_msix vf_msix_count_BB:DD.F # includes bus/dev/fn of first VF ... vf_msix_count_BB:DD.F # includes bus/dev/fn of last VFIt looks a bit odd that it isn't a subdirectory, but this seems reasonable.Sorry, I missed your point; you'll have to lay it out more explicitly. I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0 directory. The full paths would be: /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F ...Sorry, I was meaning what you first proposed: /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count Which has the extra sub directory to organize the child VFs. Keep in mind there is going to be alot of VFs here, > 1k - so this will be a huge directory.With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain 1 + 1K files ("vf_total_msix" + 1 per VF). With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain 1 file and 1K subdirectories.This is racy by design, in order to add new file and create BB:DD.F directory, the VF will need to do it after or during it's creation. During PF creation it is unknown to PF those BB:DD.F values. The race here is due to the events of PF,VF directory already sent but new directory structure is not ready yet. From code perspective, we will need to add something similar to pci_iov_sysfs_link() with the code that you didn't like in previous variants (the one that messes with sysfs_create_file API). It looks not good for large SR-IOV systems with >1K VFs with gazillion subdirectories inside PF, while the more natural is to see them in VF. So I'm completely puzzled why you want to do these files on PF and not on VF as v0, v7 and v8 proposed.
On both mlx5 and NVMe, the "assign vectors to VF" functionality is
implemented by the PF, so I think it's reasonable to explore the idea
of "associate the vector assignment sysfs file with the PF."
Assume 1K VFs. Either way we have >1K subdirectories of
/sys/devices/pci0000:00/. I think we should avoid an extra
subdirectory level, so I think the choices on the table are:
Associate "vf_msix_count" with the PF:
- /sys/.../<PF>/sriov/vf_total_msix # all on PF
- /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these). Greg
says the number of these is not a problem.
- The "vf_total_msix" and "vf_msix_count_*" files are all related
and are grouped together in PF/sriov/.
- The "vf_msix_count_*" files operate directly on the PF. Lock the
PF for serialization, lookup and lock the VF to ensure no VF
driver, call PF driver callback to assign vectors.
- Requires special sysfs code to create/remove "vf_msix_count_*"
files when setting/clearing VF Enable. This code could create
them only when the PF driver actually supports vector assignment.
Unavoidable sysfs/uevent race, see below.
Associate "vf_msix_count" with the VF:
- /sys/.../<PF>/sriov_vf_total_msix # on PF
- /sys/.../<VF>/sriov_vf_msix_count # on each VF
- The "sriov_vf_msix_count" files enter via the VF. Lock the VF to
ensure no VF driver, lookup and lock the PF for serialization,
call PF driver callback to assign vectors.
- Can be done with static sysfs attributes. This means creating
"sriov_vf_msix_count" *always*, even if PF driver doesn't support
vector assignment.
IIUC, putting "vf_msix_count_*" under the PF involves a race. When we
call device_add() for each new VF, it creates the VF sysfs directory
and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
exist yet. It can't be created before device_add() because the sysfs
directory doesn't exist. If we create it after device_add(), the "add
VF" uevent has already been emitted, so userspace may consume it
before "vf_msix_count_*" is created.
sriov_enable
<set VF Enable> <-- VFs created on PCI
sriov_add_vfs
for (i = 0; i < num_vfs; i++) {
pci_iov_add_virtfn
pci_device_add
device_initialize
device_add
device_add_attrs <-- add VF sysfs attrs
kobject_uevent(KOBJ_ADD) <-- emit uevent
<-- add "vf_msix_count_*" sysfs attr
pci_iov_sysfs_link
pci_bus_add_device
pci_create_sysfs_dev_files
device_attach
}
Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
files associated directly with the PF. I think that's more natural
because they both operate directly on the PF.
But I don't like the race, and using static attributes seems much
cleaner implementation-wise.
Bjorn