Re: [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2
From: Shay Drori <hidden>
Date: 2026-05-28 09:19:46
Also in:
linux-rdma, lkml
On 28/05/2026 1:08, Jacob Keller wrote:
On 5/27/2026 5:54 AM, Tariq Toukan wrote:quoted
Hi, This series enables Socket Direct single netdev to operate in switchdev mode with shared FDB. See detailed feature description by Shay below. Regards, Tariq This series enables Socket Direct single netdev to operate in switchdev mode with shared FDB. SD single netdev combines multiple PCI functions behind a single netdev interface. To support switchdev offloads, these functions must participate in virtual LAG (shared FDB). Design Rather than introducing a separate LAG instance for SD, this series integrates SD secondary devices into the existing LAG structure (priv.lag) created at probe time. Each lag_func entry carries a group_id field that identifies its SD group membership (0 means not part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes physical port entries from SD secondaries, enabling a single unified iterator that filters by group: - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing behavior, used by bonding, FW LAG commands, v2p_map) - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries (used by MPESW shared FDB across all devices) - specific group_id: iterate only devices in that SD group (used by per-group SD shared FDB operations) Existing callers use mlx5_ldev_for_each() which maps to MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD configurations. Lifecycle and ownership The SD LAG lifecycle is tied to the SD group, not to bonding events: 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure (priv.lag) for each LAG-capable PF. e.g.: SD primary devices 2. During mlx5_sd_init(), after the SD group is fully formed (primary and secondaries paired), sd_lag_init() registers the secondary devices into the primary's existing priv.lag by calling mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func also gets its group_id set. No separate LAG instance is created. 3. After all the devices in SD group transition to switchdev, mlx5_lag_shared_fdb_create() is invoked with the group_id to create a software-only shared FDB scoped to that SD group. This sets sd_fdb_active on all lag_func entries in the group. No FW LAG commands are issued since SD devices share the same physical port. 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the per-group SD shared FDB is torn down first, then MPESW shared FDB is created spanning all devices (ports + SD secondaries) using MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is restored. 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup() removes secondaries from priv.lag and clears the primary's group_id. The LAG structure itself is not destroyed. The sd_fdb_active flag is set on all lag_func entries in a group (not just the primary), so any device can detect the SD shared FDB state during lag_disable_change teardown without needing to look up peer entries. SD shared FDB is a pure software construct -- unlike regular LAG modes (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag commands. The software vport LAG for SD is implemented via eswitch egress ACL bounce rules, managed by the IB layer through mlx5_eth_lag_init(). And the software LAG demux is implemented via steering rules that utilize new destination, VHCA_RX.I appreciate the overall details on the lifecycle and ownership. That made it easier to follow the patches and understand the changes.quoted
Patches Infrastructure (patches 1, 5-6): - Factor out shared FDB code into a dedicated file - Extend lag_func with group_id and sd_fdb_active fields; add XA_MARK_PORT and unified iterator with group_id filter - Extend shared FDB API with group_id parameter E-Switch preparation (patches 2-3): - Align eswitch disable sequence ordering - Move devcom init from TC to eswitch layer SD group management (patches 4, 7-9): - Replace peer count check with direct peer lookup - Register SD secondaries in the existing LAG at SD init time - Block RoCE and VF LAG for SD devices - Block multipath LAG for SD devices Switchdev integration (patch 10): - Keep netdev resources local in switchdev mode Steering (patches 11-12): - Track peer flow slots with bitmap for selective peer flow deletion - Enable TC flow steering for SD LAG Enablement (patch 13): - Verify unique vhca_id count for cross-VHCA RQTThe patch 13 being the "enablement" is a bit confusing to me since I had trouble understanding how the patch description is "enabling" the socket direct stuff.. But the description does say "part 1/2" so I am guessing thats addressed in part 2?
Thanks for the review the word "enablement" here in the cover letter is a bit confusing... :( This commit prepare RQT layer for SD-over-DPU, which will also be enable by the series. in SD-over-DPU configuration, a device's vhca_id ends up failing the old range-based check.
quoted
Shay Drory (13): net/mlx5: LAG, factor out shared FDB code into dedicated file net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition net/mlx5: E-Switch, move devcom init from TC to eswitch layer net/mlx5: LAG, replace peer count check with direct peer lookup net/mlx5: LAG, prepare for SD device integration net/mlx5: LAG, extend shared FDB API with group_id filter net/mlx5: SD, introduce Socket Direct LAG net/mlx5: LAG, block RoCE and VF LAG for SD devices net/mlx5: LAG, block multipath LAG for SD devices net/mlx5: SD, keep netdev resources on same PF in switchdev mode net/mlx5e: TC, track peer flow slots with bitmap net/mlx5e: TC, enable steering for SD LAG net/mlx5e: Verify unique vhca_id count instead of range .../net/ethernet/mellanox/mlx5/core/Makefile | 2 +- .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +- .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 + .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++-- .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +- .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++ .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++-------- .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++- .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 + .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +- .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++ .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++-- .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 + .../net/ethernet/mellanox/mlx5/core/main.c | 3 +- 14 files changed, 914 insertions(+), 289 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10