Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init
From: Mark Bloch <mbloch@nvidia.com>
Date: 2026-05-28 08:15:58
Also in:
linux-doc, linux-rdma, lkml
On 27/05/2026 14:18, Jiri Pirko wrote:
Wed, May 27, 2026 at 09:03:26AM +0200, mbloch@nvidia.com wrote:quoted
On 27/05/2026 8:14, Jiri Pirko wrote:quoted
Tue, May 26, 2026 at 07:13:46PM +0200, mbloch@nvidia.com wrote:quoted
On 26/05/2026 19:23, Jiri Pirko wrote:quoted
Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@nvidia.com wrote:quoted
On 26/05/2026 17:07, Jiri Pirko wrote:quoted
Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@nvidia.com wrote:quoted
On 26/05/2026 10:44, Jiri Pirko wrote:quoted
Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@nvidia.com wrote:quoted
From: Mark Bloch <mbloch@nvidia.com> Apply devlink default eswitch mode for mlx5 devices after successful device initialization while holding the devlink instance lock. At this point the devlink instance is registered and the mlx5 devlink operations are available, so the default eswitch mode can be applied to the matching PCI devlink handle. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Shay Drori <redacted> Reviewed-by: Moshe Shemesh <redacted> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 0c6e4efe38c8..4528097f3d84 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c@@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)mlx5_free_bfreg(dev, &dev->priv.bfreg); } +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev) +{ + struct devlink *devlink = priv_to_devlink(dev); + int err; + + if (!MLX5_ESWITCH_MANAGER(dev)) + return; + + devl_assert_locked(devlink); + err = devl_apply_default_esw_mode(devlink); + if (err) + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n", + err); +} + int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev) { bool light_probe = mlx5_dev_is_lightweight(dev);@@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err); mutex_unlock(&dev->intf_state_mutex); + mlx5_devl_apply_default_esw_mode(dev);I wonder how we can make this work for all. I mean, other driver would silently ignore this command like arg, right? Any idea how to make all drivers follow the arg from very beginning?I have a follow-up series that adds the call to all drivers which support setting eswitch mode. When going over the other drivers, what I found is that the right point to apply the default is driver specific, drivers I have patch for: 46e16c6d9836 net: Apply devlink esw mode defaults ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init b48cce1607bb liquidio: Apply devlink default eswitch mode during init 4ea54b0fe04a ice: Apply devlink default eswitch mode during init b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init 5000e4c3d768 nfp: Apply devlink default eswitch mode during init 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init I don't think doing this generically from devlink is realistic. devlink doesn't really know when a given driver is ready to change eswitch mode. Some drivers need SR-IOV state, representor setup, or other init pieces to be ready first, and the locking is not identical across drivers either.Low hanging fruit would be just to call ops->eswitch_mode_set at the end of register. Multiple reasons: 1) end of devl_register is exactly the point userspace is free to issue the eswitch mode set. Driver should be ready to handle it. 2) all drivers would transparently get this functionality, without actually knowing this kernel command line arg ever existed, without odd wiring call of related exported function. I prefer that stongly. 3) you should add a there warning for the case this arg is passed yet the driver does not implement eswitch_mode_set. User should get a feedback like this, not silent ignore. The only loose end is see it the void return of devl_register(). Multiple ways to handle the possibly failed eswitch_mode_set(). I would probably just go for pr_warn, seems to be the most correct. Make sense?I see the point, but I don't think devl_register() (at least not the only place) is the right place. There is a small but important difference between userspace doing "devlink eswitch set" after register is done, and devlink core calling eswitch_mode_set() from inside the register flow. Some drivers call devlink_register() while holding the device lock. liquidio is one example. If devlink core calls ops->eswitch_mode_set() from there, we may start the full eswitch mode change while holding that lock. That mode change can create representors, register netdevs, take rtnl, allocate resources, etc. I don't think we want this to become an implicit side effect of devlink registration.I believe your AI may untagle liquidio locking :)I didn't try to solve that one with ai. Most drivers were fairly simple so I didn't use ai at all. bnxt was the one where I needed a bit of help :)quoted
quoted
For mlx5, the placement after intf_state_mutex is also intentional: mutex_unlock(&dev->intf_state_mutex); mlx5_devl_apply_default_esw_mode(dev); We can't call it while holding intf_state_mutex because the mode set path takes it internally, and switchdev mode may also create IB representors. Also, devl_register() only covers the first registration. The mlx5 call in mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows. In those flows devlink is already registered, so devl_register() is not called again, but the driver state was rebuilt and we may need to apply the default again.Call it from reload too, right?Yes, that was my first thought: apply it from devl_register() for the first registration and from devlink_reload() after a successful DRIVER_REINIT. That covers the clean devlink reload path but....(see bellow)quoted
quoted
Same for reload, fw reset and pci recovery in general. If the driver tears down and rebuilds eswitch related state, the place to apply the default is in that driver's reinit flow, not in devl_register(). When I went over the other drivers, the right place was not always the same as devlink registration. I'm not an expert in any of them, so I hope I got the details right, but for example octeontx2 AF needs sr-iov and the representor switch state to be initialized first. nfp can do it after app/vNIC init while the devlink lock is already held. liquidio should do it only after dropping the PCI device lock.Idk, perhaps do it from devlink_post_register_work of some kind? That would allow you to have the same locking ordering as a userspace call. I thought about a workqueue too, it was actually my first idea. The problem is that then we race with userspace. In the mlx5 version here the default is applied while the devlink lock is still held, before userspace can come in and issue its own eswitch set. If we defer it to post-register work, the devlink instance is already visible and userspace can get there first and then we might change the user configuration.Figure that out and expose to user by setting xa_mark only after the work is done? This is doable.I agree that if devlink can keep the instance hidden/unavailable until the post register work is done, that solves the initial userspace race. The other part is the reinit/recovery case. For that I think devlink core needs some explicit indication from the driver that the device is now in reinit. Something like (at least that's the code I had initially, but something along those lines): void devl_dev_reinit_begin(struct devlink *devlink); void devl_dev_reinit_end(struct devlink *devlink); void devl_dev_reinit_abort(struct devlink *devlink); The core can then mark the instance as temporarily unavailable/in reinit between begin/end, and the relevant lookup/dump paths, for example devlink_get_from_attrs_lock() and devlink_nl_inst_iter_dumpit(), can reject or skip it while reinit is in progress. devlink_reload() can probably mark this state by itself around DRIVER_REINIT.I believe this is orthogonal to the problem you are trying to solve in this patchset. Not sure why you bring it in to the conversation...
I brought it up because I was also thinking about reinit/recovery flows, but I guess I can tackle that later. For now I can focus on the generic devlink path, move drivers to register devlink only after the device is ready. Then devlink core can apply the default before exposing the instance to userspace. I think it is better to fix the ordering for all devlink drivers, not only the ones that support eswitch mode set. That gives us a consistent model and makes future defaults easier. Reload can be handled from devlink after successful DRIVER_REINIT. Does this sound ok? Mark
quoted
Then mlx5 would look more or less like: devl_lock(devlink); devl_dev_reinit_begin(devlink); ret = mlx5_load_one_devl_locked(dev, recovery); if (!ret) devl_dev_reinit_end(devlink); else devl_dev_reinit_abort(devlink); devl_unlock(devlink); This gives devlink core a way to know that the devlink instance is registered, but should not be used by userspace at the moment. It also allows keeping the default/config apply logic in devlink instead of adding driver specific calls to apply it in each init path. But this still means the generic solution needs some driver help. Drivers need to register devlink at a point where the post-register default apply is safe, and full reinit paths need to be marked with this begin/end API.quoted
quoted
Also, the bigger issue for mlx5 is not only initial registration or devlink reload. Some recovery paths, pci resume, and fw reset flows rebuild the driver state without going through devlink at all. I did not find a clean way for devlink core to infer all those points by itself.If you don't obey current configuration for example in pci resume, it is bug and you should fix it. All these flows should obey current eswitch mode configuration.I agree that the device should come back according to the intended high level policy. But I don't think full reinit can be treated as restoring the whole previous runtime state. There may be user created steering rules and other objects which the driver cannot keep or replay. Today full reinit brings the device back to a clean initialized state, and that is intentional. So the split I have in mind is: - full runtime state is not preserved across full reinit; - high level devlink policy/configuration should be applied when the device is initialized again; - the command line default should not blindly override a later explicit userspace eswitch mode selection. I am not against moving this into devlink core, and I am willing to work on it. But before I rework the series, I want to make sure we agree on the direction. As I see it, doing this cleanly needs a devlink state like "registered but unavailable/in reinit", plus driver annotations for the reinit paths. If this is not the direction you want, I prefer to know now rather than spend time on a version that will be rejected anyway. Markquoted
quoted
To handle that from devlink I would still need to add some api for the driver to tell devlink "I just reinitialized, apply the default now". but nce I had that driver call , it felt simpler and clearer to let the driver call the helper directly at the points where it knows eswitch mode is safe. I agree that handling all of this inside devlink would be the better option. I just couldn't make it work in a clean way. Markquoted
quoted
Markquoted
quoted
Also, since this knob is only about eswitch mode, I don't think we need to touch every devlink driver. Drivers that don't implement eswitch_mode_set() would just ignore it anyway. The follow-up only wires the default into drivers that actually support changing eswitch mode. Markquoted
quoted
return 0; err_register:@@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)goto err_attach; mutex_unlock(&dev->intf_state_mutex); + mlx5_devl_apply_default_esw_mode(dev); return 0; err_attach: -- 2.44.0