Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
From: Cosmin Ratiu <hidden>
Date: 2026-02-02 14:48:32
Also in:
linux-rdma, lkml
From: Cosmin Ratiu <hidden>
Date: 2026-02-02 14:48:32
Also in:
linux-rdma, lkml
On Thu, 2026-01-29 at 15:40 -0800, Jakub Kicinski wrote:
On Thu, 29 Jan 2026 10:33:40 +0000 Cosmin Ratiu wrote:quoted
quoted
This is quite an ugly hack, is there no way to avoid the flush and let the work discover that what it was supposed to do is no longer needed?Not possible, unfortunately. I stared at it for quite a while. The wq is flushed because the esw is being unconfigured, which removes data structs the work handler uses. Flushing the work is required, otherwise we'll run into worse issues.And having a refount on (I presume) struct mlx5_esw_functions so that work can hold a ref is not an option? Are you planning to revisit this in -next?
Currently, mlx5_eswitch_disable_locked (with the devlink lock held) waits for esw_vfs_changed_event_handler to finish. The event handler needs to acquire the same lock and load/unload all VFs, which touches the entire esw. I don't currently see how to use reference counting on the esw to avoid waiting for the handler. But we can have a deeper look as part of an internal task to improve this. For now, please accept the V3 fix (about-to-be-sent) with the current approach because we couldn't find a better way. Cosmin.