Thread (20 messages) 20 messages, 4 authors, 24d ago

Re: [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2

From: Shay Drori <hidden>
Date: 2026-05-28 09:19:46
Also in: linux-rdma, lkml


On 28/05/2026 1:08, Jacob Keller wrote:
On 5/27/2026 5:54 AM, Tariq Toukan wrote:
quoted
Hi,

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. See detailed feature description by Shay below.

Regards,
Tariq


This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

   - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
     behavior, used by bonding, FW LAG commands, v2p_map)
   - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
     (used by MPESW shared FDB across all devices)
   - specific group_id: iterate only devices in that SD group (used by
     per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
    (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
    and secondaries paired), sd_lag_init() registers the secondary
    devices into the primary's existing priv.lag by calling
    mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
    also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
    mlx5_lag_shared_fdb_create() is invoked with the group_id to create
    a software-only shared FDB scoped to that SD group. This sets
    sd_fdb_active on all lag_func entries in the group. No FW LAG
    commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
    per-group SD shared FDB is torn down first, then MPESW shared FDB is
    created spanning all devices (ports + SD secondaries) using
    MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
    restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
    removes secondaries from priv.lag and clears the primary's group_id.
    The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.
I appreciate the overall details on the lifecycle and ownership. That
made it easier to follow the patches and understand the changes.
quoted
Patches

Infrastructure (patches 1, 5-6):
   - Factor out shared FDB code into a dedicated file
   - Extend lag_func with group_id and sd_fdb_active fields;
     add XA_MARK_PORT and unified iterator with group_id filter
   - Extend shared FDB API with group_id parameter

E-Switch preparation (patches 2-3):
   - Align eswitch disable sequence ordering
   - Move devcom init from TC to eswitch layer

SD group management (patches 4, 7-9):
   - Replace peer count check with direct peer lookup
   - Register SD secondaries in the existing LAG at SD init time
   - Block RoCE and VF LAG for SD devices
   - Block multipath LAG for SD devices

Switchdev integration (patch 10):
   - Keep netdev resources local in switchdev mode

Steering (patches 11-12):
   - Track peer flow slots with bitmap for selective peer flow deletion
   - Enable TC flow steering for SD LAG

Enablement (patch 13):
   - Verify unique vhca_id count for cross-VHCA RQT
The patch 13 being the "enablement" is a bit confusing to me since I had
trouble understanding how the patch description is "enabling" the socket
direct stuff..  But the description does say "part 1/2" so I am guessing
thats addressed in part 2?
Thanks for the review

the word "enablement" here in the cover letter is a bit confusing... :(
This commit prepare RQT layer for SD-over-DPU, which will also be enable
by the series.
in SD-over-DPU configuration, a device's vhca_id ends up failing the old
range-based check.
quoted
Shay Drory (13):
   net/mlx5: LAG, factor out shared FDB code into dedicated file
   net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy
     transition
   net/mlx5: E-Switch, move devcom init from TC to eswitch layer
   net/mlx5: LAG, replace peer count check with direct peer lookup
   net/mlx5: LAG, prepare for SD device integration
   net/mlx5: LAG, extend shared FDB API with group_id filter
   net/mlx5: SD, introduce Socket Direct LAG
   net/mlx5: LAG, block RoCE and VF LAG for SD devices
   net/mlx5: LAG, block multipath LAG for SD devices
   net/mlx5: SD, keep netdev resources on same PF in switchdev mode
   net/mlx5e: TC, track peer flow slots with bitmap
   net/mlx5e: TC, enable steering for SD LAG
   net/mlx5e: Verify unique vhca_id count instead of range

  .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
  .../net/ethernet/mellanox/mlx5/core/en/rqt.c  |  27 +-
  .../ethernet/mellanox/mlx5/core/en/tc_priv.h  |   7 +
  .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  83 ++--
  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  11 +-
  .../mellanox/mlx5/core/eswitch_offloads.c     |  26 ++
  .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++--------
  .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++-
  .../net/ethernet/mellanox/mlx5/core/lag/mp.c  |   4 +
  .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  28 +-
  .../mellanox/mlx5/core/lag/shared_fdb.c       | 233 ++++++++++
  .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 227 +++++++--
  .../net/ethernet/mellanox/mlx5/core/lib/sd.h  |  23 +
  .../net/ethernet/mellanox/mlx5/core/main.c    |   3 +-
  14 files changed, 914 insertions(+), 289 deletions(-)
  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c


base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help