Thread (49 messages) 49 messages, 6 authors, 2021-01-19

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

From: Leon Romanovsky <leon@kernel.org>
Date: 2021-01-16 08:22:34
Also in: linux-pci, linux-rdma

On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky [off-list ref] wrote:
quoted
On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
quoted
On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
quoted
That said, it only works at the driver level. So if the firmware is
the one that is having to do this it also occured to me that if this
update happened on FLR that would probably be preferred.
FLR is not free, I'd prefer not to require it just for some
philosophical reason.
quoted
Since the mlx5 already supports devlink I don't see any reason why the
driver couldn't be extended to also support the devlink resource
interface and apply it to interrupts.
So you are OK with the PF changing the VF as long as it is devlink not
sysfs? Seems rather arbitary?

Leon knows best, but if I recall devlink becomes wonky when the VF
driver doesn't provide a devlink instance. How does it do reload of a
VF then?

I think you end up with essentially the same logic as presented here
with sysfs.
The reasons why I decided to go with sysfs are:
1. This MSI-X table size change is applicable to ALL devices in the world,
and not only netdev.
In the PCI world MSI-X table size is a read only value. That is why I
am pushing back on this as a PCI interface.
And it stays read-only.
quoted
2. This is purely PCI field and apply equally with same logic to all
subsystems and not to netdev only.
Again, calling this "purely PCI" is the sort of wording that has me
concerned. I would prefer it if we avoid that wording. There is much
more to this than just modifying the table size field. The firmware is
having to shift resources between devices and this potentially has an
effect on the entire part, not just one VF.
It is internal to HW implementation, dumb device can solve it differently.
quoted
3. The sysfs interface is the standard way of configuring PCI/core, not
devlink.
This isn't PCI core that is being configured. It is the firmware for
the device. You are working with resources that are shared between
multiple functions.
I'm ensuring that "lspci -vv .." will work correctly after such change.
It is PCI core responsibility.
quoted
4. This is how orchestration software provisioning VFs already. It fits
real world usage of SR-IOV, not the artificial one that is proposed during
the discussion.
What do you mean this is how they are doing it already? Do you have
something out-of-tree and that is why you are fighting to keep the
sysfs? If so that isn't a valid argument.
I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
They already use sriov_driver_autoprobe sysfs knob to disable autobind
before even starting. They configure MACs and bind VFs through sysfs/netlink
already. For them, the read/write of sysfs that is going to be bound to
the already created VM with known CPU properties, fits perfectly.
quoted
So the idea to use devlink just because mlx5 supports it, sound really
wrong to me. If it was other driver from another subsystem without
devlink support, the request to use devlink won't never come.

Thanks
I am suggesting the devlink resources interface because it would be a
VERY good fit for something like this. By the definition of it:
``devlink`` provides the ability for drivers to register resources, which
can allow administrators to see the device restrictions for a given
resource, as well as how much of the given resource is currently
in use. Additionally, these resources can optionally have configurable size.
This could enable the administrator to limit the number of resources that
are used.
It is not resource, but HW objects. The devlink doesn't even see the VFs
as long as they are not bound to the drivers.

This is an example:

[root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe
[root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
[ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3)
[root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
[ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3)
[ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000
[ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000
[root@vm ~]# devlink dev
pci/0000:01:00.0
[root@vm ~]# lspci |grep nox
01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]

So despite us having 2 VFs ready to be given to VMs, administrator doesn't
see them as devices.
Even looking over the example usage I don't see there being much to
prevent you from applying it to this issue. In addition it has the
idea of handling changes that cannot be immediately applied already
included. Your current solution doesn't have a good way of handling
that and instead just aborts with an error.
Yes, because it is HW resource that should be applied immediately to
make sure that it is honored, before it is committed to the users.

It is very tempting to use devlink everywhere, but it is really wrong
tool for this scenario.

Thanks
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help