Thread (15 messages) 15 messages, 1 author, 4h ago
HOTtoday
Revisions (10)
  1. v1 [diff vs current]
  2. v2 [diff vs current]
  3. v3 [diff vs current]
  4. v4 [diff vs current]
  5. v5 [diff vs current]
  6. v6 [diff vs current]
  7. v7 [diff vs current]
  8. v8 [diff vs current]
  9. v9 [diff vs current]
  10. v10 current

[PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling

From: Tariq Toukan <tariqt@nvidia.com>
Date: 2026-07-01 07:33:56
Also in: linux-doc, linux-kselftest, linux-rdma, lkml

Hi,

This series by Cosmin adds support for cross-function rate scheduling in
devlink and mlx5.
See detailed explanation by Cosmin below [0].

Regards,
Tariq

[0]
devlink objects support rate management for TX scheduling, which
involves maintaining a tree of rate nodes that corresponds to TX
schedulers in hardware. 'man devlink-rate' has the full details.

The tree of rate nodes is maintained per devlink object, protected by
the devlink lock.

There exists hardware capable of instantiating TX scheduling trees
spanning multiple functions of the same physical device (and thus
devlink objects) and therefore the current API and locking scheme is
insufficient.

This patch series changes the devlink rate implementation and API to
allow supporting such hardware and managing TX scheduling trees across
multiple functions of a physical device.

Modeling this requires having devlink rate nodes with parents in other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.

The solution proposed in this patch series makes use of the recently
introduced shared devlink instance [1] to manage rate hierarchy changes
across multiple functions.

V1 of this patch series was sent a long time ago [2], using a different
approach of storing rates in a shared rate domain with special locking
rules. This new approach uses standard devlink instances and nesting.

The first part of the series adds support to devlink rates for
maintaining the rate tree across multiple functions.

The second part changes the mlx5 implementation to make use of this (and
cleans up remnants of the previous approach, involving rate domains).

The neat part about using the shared devlink object is that it works for
SFs as well, which are already nested in their parent PF instances. So
with this series, complex scheduling trees spanning multiple SFs across
multiple PFs of the same NIC can now be supported.

---

[1] https://lore.kernel.org/all/20260312100407.551173-1-jiri@resnulli.us/T/#u (local)
[2] https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/ (local)
[3] https://lore.kernel.org/netdev/20260325152801.236343-1-jiri@resnulli.us/ (local)

Comments from internal Sashiko review:
When a rate node is linked to a parent on a different devlink device, should
the rate-get reply and dump operations also include the parent's device
information?
Currently, parent-dev is only added to the request attributes for rate-set
and rate-new, but it appears to be omitted from the serialization logic and
the rate-get reply definition. Does this render the cross-device
relationship invisible to user space monitoring tools?
[CR] Preexisting problem with devlink rate API, the rate-get command
specifies only a subset of output attributes. May be fixed separately.
quoted
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 5436811cfbbd..2b30329a1394 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -660,9 +660,11 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)

 int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+	struct devlink *devlink = ctx->devlink;
While ctx is now retrieved, does the rest of the function ignore
ctx->parent_devlink?
It looks like the command handlers never retrieve or use the parent devlink,
and unconditionally use the child's devlink for lookups and assignments. Will
this cause cross-device parent assignments to silently fail or return -ENODEV
since they search for the parent node exclusively on the child's device?
[CR] False positive, it's used in devlink_nl_rate_parent_node_set().
Additionally, looking at how the parent devlink is prepared in
__devlink_nl_pre_doit() (which is called via
DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV):
__devlink_nl_pre_doit()
    if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV]) {
        parent_devlink = devlink_get_parent_from_attrs_lock(net, attrs);
        ...
        devlink_nl_ctx(info)->parent_devlink = parent_devlink;
        /* Drop the parent devlink lock but don't release the reference.
         * This will keep it alive until the end of the request.
         */
        devl_unlock(parent_devlink);
    }
To actually implement the cross-device feature in the command handlers, the
parent's rate list would need to be accessed. Since the parent devlink lock
is explicitly dropped here, wouldn't accessing the parent's rate list
without its lock violate devlink locking rules and lead to data races?
[CR] Locking changes are intentional. All rate changes for anything in
the rate group now acquire the root devlink lock, where rates are
stored.
quoted
  * devl_rate_nodes_destroy - destroy all devlink rate nodes on device
  * @devlink: devlink instance
  *
- * Unset parent for all rate objects and destroy all rate nodes
- * on specified device.
+ * Unset parent for all rate objects involving this device and destroy all rate
+ * nodes on it.
  */
 void devl_rate_nodes_destroy(struct devlink *devlink)
 {
-	const struct devlink_ops *ops = devlink->ops;
 	struct devlink_rate *devlink_rate, *tmp;
+	const struct devlink_ops *ops;
+	struct devlink *rate_devlink;

 	devl_assert_locked(devlink);
+	rate_devlink = devl_rate_lock(devlink);

-	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
-		if (!devlink_rate->parent)
+	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
+		if (!devlink_rate->parent ||
+		    (devlink_rate->devlink != devlink &&
+		     devlink_rate->parent->devlink != devlink))
 			continue;

+		/* This could destroy rate objects on other devlinks in the
+		 * same hierarchy under 'rate_devlink'. This is safe because
+		 * the shared common ancestor is locked so there can be no
+		 * other concurrent rate operations on devlink_rate->devlink.
+		 */
+		ops = devlink_rate->devlink->ops;
Does this code invoke devlink_ops callbacks on other devlink instances
without holding their instance lock?
While the comment indicates the shared common ancestor lock prevents concurrent
rate operations, the child device's instance lock (devlink_rate->devlink) is
not held.
Could this violate the devlink locking contract if the driver is executing
concurrent non-rate operations on the child instance? Because devlink
drivers rely on the instance lock to serialize access to their internal
structures, could invoking these callbacks unlocked cause data corruption or
trigger lockdep splats if the driver asserts devl_assert_locked()?
[CR] As mentioned in the comment, concurrent ops which touch rates
_will_ eventually acquire the shared devlink lock, so they're safe. We
don't care about other ops, since we only touching rates here.
quoted
@@ -321,22 +307,19 @@ static int esw_qos_create_rate_limit_element(struct mlx5_esw_sched_node *node,
 	return esw_qos_node_create_sched_element(node, sched_ctx, extack);
 }

-static u32 esw_qos_calculate_min_rate_divider(struct mlx5_eswitch *esw,
-					      struct mlx5_esw_sched_node *parent)
+static u32
+esw_qos_calculate_min_rate_divider(struct mlx5_esw_sched_node *parent)
 {
-	struct list_head *nodes = parent ? &parent->children : &esw->qos.domain->nodes;
-	u32 fw_max_bw_share = MLX5_CAP_QOS(esw->dev, max_tsar_bw_share);
+	u32 fw_max_bw_share = MLX5_CAP_QOS(parent->esw->dev, max_tsar_bw_share);
This is a pre-existing issue, but does this division risk a divide-by-zero
panic regression?
If the hardware does not support bandwidth sharing, max_tsar_bw_share
could be 0. If a user subsequently sets a non-zero tx_share (minimum rate
guarantee), max_guarantee becomes greater than 0, resulting in
max_guarantee / fw_max_bw_share.
Should we validate whether fw_max_bw_share is non-zero before performing
the division?
[CR] Preexisting, also not really possible for fw to give a value of 0
for the divisor.
quoted
@@ -665,7 +631,6 @@ static int esw_qos_create_tc_arbiter_sched_elem(
 		struct netlink_ext_ack *extack)
 {
 	u32 tsar_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
-	u32 tsar_parent_ix;
 	void *attr;
This is a pre-existing issue, but is there a hardware scheduling element
leak when transitioning a node to a TC arbiter?
When esw_qos_tc_arbiter_scheduling_setup() calls
esw_qos_create_tc_arbiter_sched_elem(node, extack), it provisions a new
hardware element and immediately overwrites node->ix with its ID.
If the subsequent step esw_qos_create_vports_tc_nodes(node, extack) fails,
it jumps to err_tc_node_create and restores node->ix = curr_ix, but it
doesn't appear to tear down the newly created scheduling element. Does this
permanently leak the newly provisioned hardware element in the firmware?
[CR] False alarm, the call to esw_qos_node_destroy_sched_element()
destroys the new temporary scheduling element on failure.
quoted
@@ -1319,11 +1288,9 @@ static int esw_qos_switch_tc_arbiter_node_to_vports(
 	struct mlx5_esw_sched_node *node,
 	struct netlink_ext_ack *extack)
 {
This is a pre-existing issue, but what happens if the rollback path fails
during vport migration in esw_qos_switch_vports_node_to_tc_arbiter()?
If an error occurs, it attempts to rollback by calling
esw_qos_switch_vport_tcs_to_vport(). If this rollback fails internally
(e.g. due to ENOMEM in esw_qos_vport_update()), the vport is
successfully restored to the doomed TC arbiter node, and the error
is silently ignored.
Then, the caller esw_qos_node_enable_tc_arbitration() executes
esw_qos_tc_arbiter_scheduling_teardown(), which frees the TC arbiter's
VPORTS_TC_TSAR nodes without freeing the attached vport VPORT_TC nodes.
Does this leak the vport TC nodes and leave them holding use-after-free
pointers to their freed parents, leading to memory corruption when the
vport is later disabled?
[CR] Preexisting, also hard to hit. Requires a failure mid-way when
switching nodes to TC arbitration AND a failure during rollback. Fixing
it outside the scope of this series.
quoted
@@ -1888,7 +1855,9 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
 		err = mlx5_esw_qos_vport_enable(vport, type, parent, 0, 0,
 						extack);
 	} else if (vport->qos.sched_node) {
-		err = esw_qos_vport_update_parent(vport, parent, extack);
+		err = esw_qos_vport_update_parent(vport,
+						  parent ? : esw->qos.root,
+						  extack);
Can this sequence cause a use-after-free regression on esw->qos.root
when updating a vport's QoS configuration?
If a vport is the only QoS user in the E-Switch, esw->qos.refcnt is 1.
If its QoS configuration is updated, esw_qos_vport_update() is called
with the parent argument pointing to esw->qos.root (since this patch
changes the default parent from NULL to esw->qos.root).
Inside esw_qos_vport_update(), mlx5_esw_qos_vport_disable_locked() drops
the vport's QoS reference, decrementing esw->qos.refcnt to 0. This
triggers esw_qos_destroy(), freeing esw->qos.root.
The function then calls mlx5_esw_qos_vport_enable(..., parent, ...) passing
the stale parent pointer. Because parent is no longer NULL, it doesn't
fetch the newly re-allocated root node, and instead passes the freed
pointer to __esw_qos_alloc_node(), resulting in a use-after-free.
[CR] False alarm. The finding confuses esw_qos_vport_disable() and
mlx5_esw_qos_vport_disable_locked(). esw_qos_vport_disable() doesn't
touch the refcnt.
quoted
@@ -1859,13 +1840,15 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
 						  extack);
 	}
-	esw_qos_unlock(esw);
+
 	return err;
 }

 void mlx5_esw_qos_vport_clear_parent(struct mlx5_vport *vport)
 {
+	esw_qos_lock(vport->dev);
 	mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
+	esw_qos_unlock(vport->dev);
 }
Could this lead to a recursive mutex deadlock during VF teardown on older
hardware where shd is NULL?
In the teardown path, mlx5_eswitch_unload_pf_vf_vport() already acquires
esw->state_lock. It then proceeds to call:
mlx5_eswitch_unload_pf_vf_vport()
  mlx5_eswitch_unload_vport()
    mlx5_esw_offloads_unload_rep()
      mlx5_esw_offloads_devlink_port_unregister()
        mlx5_esw_qos_vport_clear_parent()
Since this patch changes mlx5_esw_qos_vport_clear_parent() to
unconditionally call esw_qos_lock(), which falls back to acquiring
esw->state_lock when dev->shd is NULL, won't this result in an attempt to
re-acquire the non-recursive state_lock that is already held by the
teardown process?
[CR] False alarm. state_lock isn't held during VF teardown.
quoted
@@ -839,13 +940,17 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
 		refcount_dec(&devlink_rate->parent->refcnt);
 		devlink_rate->parent = NULL;
When unsetting the parent of a rate object in devl_rate_nodes_destroy(),
this patch allows it to happen to cross-device child nodes (where
devlink_rate->devlink != devlink). Since the child's devlink instance is
still active, shouldn't its state change (losing its parent) be
communicated to user space?
Omitting devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW) here might
leave user space monitors out of sync, incorrectly believing the active child
is still attached to the deleted parent node.
[CR] Preexisting, devl_rate_nodes_destroy() didn't send notifications.
quoted
+static struct devlink *devl_rate_lock(struct devlink *devlink)
 {
-	return devlink;
+	struct devlink *rate_devlink = devlink, *parent;
+
+	devl_assert_locked(devlink);
+
+	while (rate_devlink->ops &&
+	       rate_devlink->ops->supported_cross_device_rate_nodes) {
+		parent = devlink_nested_in_get_lock(rate_devlink);
+		if (!parent)
+			break;
Can this lead to list corruption if the parent devlink is unregistered?
If the parent is unregistered concurrently (for example, a PF is unregistered
before VF teardown), devlink_nested_in_get_lock() will return NULL because of
this internal check:
devlink_nested_in_get_lock() {
    ...
    if (devl_is_registered(devlink))
        return devlink;
    ...
    return NULL;
}
Since parent is NULL, this loop breaks and returns the child devlink lock
instead of the parent's lock. Later, in functions like devl_rate_leaf_destroy(),
list_del(&devlink_rate->list) is called on the rate node.
Because the rate node is chained into the parent's rate_list, does this
modify the parent's list concurrently without holding its lock?
[CR] This misunderstands the nesting hierarchy. VFs are not nested in
PFs, PFs are nested in shd, which is reference counted (=> outlives all
PFs) and keeps all rate nodes for all PFs.
This is a pre-existing issue, but does devlink_nl_rate_del_doit()
unconditionally free rate nodes even if the driver rejects the deletion?
Looking at devlink_nl_rate_del_doit() in net/devlink/rate.c:
devlink_nl_rate_del_doit() {
    ...
    err = devlink->ops->rate_node_del(rate_node, rate_node->priv,
                                      info->extack);
    if (rate_node->parent)
        refcount_dec(&rate_node->parent->refcnt);
    list_del(&rate_node->list);
    kfree(rate_node->name);
    kfree(rate_node);
    ...
}
If the hardware or driver cannot fulfill the request and ops->rate_node_del()
returns an error like -EBUSY, the function ignores the error regarding memory
management and proceeds to kfree() the rate node.
Could this leave the driver with a dangling pointer to the freed rate node,
triggering a use-after-free later?
[CR] Preexisting, it is intentional that the node is removed no matter
what the rate_node_del op manages to do.
quoted
@@ -243,7 +243,29 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
+	err = nla_parse_nested(tb, maxtype, attrs[DEVLINK_ATTR_PARENT_DEV],
+			       devlink_dl_parent_dev_nl_policy, NULL);
Could this nla_parse_nested() call pass down an extended ACK pointer instead
of NULL?
When passing NULL for the extack parameter, if the nested attribute is
malformed (such as a missing NLA_F_NESTED flag), userspace will receive a
silent -EINVAL without a descriptive error message.
Because devlink_get_parent_from_attrs_lock() is called by
__devlink_nl_pre_doit() which has access to the genl_info structure and
its extack pointer, passing it down to this function would allow for
reporting accurate validation errors to userspace.
[CR] Valid, but quite a lot of plumbing is necessary for extack to be
used here. Out of scope of the series.
This is a pre-existing issue, but does the error path in
devlink_nl_rate_new_doit() leak the parent's reference count?
If devlink_nl_rate_set() successfully sets the parent (which increments
parent->refcnt) but fails on a subsequent attribute validation like tc_bw,
it returns an error.
The cleanup path in devlink_nl_rate_new_doit() handles the error:
err_rate_set:
    ops->rate_node_del(rate_node, rate_node->priv, info->extack);
err_node_new:
    kfree(rate_node->name);
err_strdup:
    kfree(rate_node);
Because this path frees the node but never decrements the parent's refcnt,
does this permanently elevate the parent's refcount, falsely indicating it
has children and preventing it from ever being deleted?
[CR] Preexisting, will be fixed separately with a net patch.
quoted
@@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
 	return devlink_rate ?: ERR_PTR(-ENODEV);
 }

+/* Repeatedly walks the nested devlink chain while cross device rate nodes are
+ * supported and finds the topmost instance where rates should be stored.
+ * That instance is locked, referenced and returned.
+ * When cross device rate nodes aren't supported the original devlink instance
+ * is returned.
+ */
 static struct devlink *devl_rate_lock(struct devlink *devlink)
 {
-	return devlink;
+	struct devlink *rate_devlink = devlink, *parent;
+
+	devl_assert_locked(devlink);
+
+	while (rate_devlink->ops &&
+	       rate_devlink->ops->supported_cross_device_rate_nodes) {
+		parent = devlink_nested_in_get_lock(rate_devlink);
+		if (!parent)
+			break;
+		if (rate_devlink != devlink) {
+			/* Unlock intermediate instances. */
+			devl_unlock(rate_devlink);
+			devlink_put(rate_devlink);
+		}
+		rate_devlink = parent;
+	}
+	return rate_devlink;
 }
Now that devl_rate_lock() returns the shared ancestor, does this cause
devlink_rates_check() to skip child rate nodes?
In net/devlink/rate.c:devlink_rates_check(), the loop filters nodes by the
devlink pointer:
	rate_devlink = devl_rate_lock(devlink);
	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list)
		if (devlink_rate->devlink == devlink &&
		    (!rate_filter || rate_filter(devlink_rate))) {
			...
			err = -EBUSY;
			break;
		}
Because of this check, if a user creates an SF rate node and then triggers an
eswitch mode change to legacy on the PF devlink, devlink_rates_check() skips
the SF node and returns 0 instead of -EBUSY.
Could this allow mlx5_devlink_eswitch_mode_set() to proceed and unconditionally
free the QoS domain in mlx5_esw_qos_cleanup(), while the child SF devlink and
its software rate node remain active?
If so, a subsequent operation on the active child SF rate node (like
DEVLINK_CMD_RATE_DEL) would invoke mlx5_esw_devlink_rate_node_del() and call
esw_qos_lock(), leading to a NULL pointer dereference when accessing
esw->qos.domain->lock.
[CR] False alarm, no driver uses supported_cross_device_rate_nodes yet
so this code path isn't taken. QoS domains are removed in a later patch
in the series before the cross device flag is enabled, so this finding
is moot.
quoted
 static int mlx5_esw_qos_max_link_speed_get(struct mlx5_core_dev *mdev, u32 *link_speed_max,
-					   bool take_rtnl,
 					   struct netlink_ext_ack *extack)
 {
 	int err;

-	if (!mlx5_lag_is_active(mdev))
+	if (!mlx5_lag_is_active(mdev) ||
+	    mlx5_lag_query_bond_speed(mdev, link_speed_max) < 0 ||
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a pre-existing issue, but does mlx5_lag_query_bond_speed() have a race
condition when accessing the ldev->tracker structure?
The mlx5_lag_query_bond_speed() function reads ldev->tracker.bond_speed_mbps
while holding lag_lock. However, mlx5_lag_netdev_event() modifies
ldev->tracker holding only rtnl_lock, without holding lag_lock:
drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c:mlx5_lag_netdev_event() {
    ...
    ldev->tracker = tracker;
    ...
}
Concurrently, mlx5_ldev_add_netdev() modifies fields inside it under lag_lock,
meaning its updates can be overwritten and lost by the read-modify-write in
mlx5_lag_netdev_event().
Since this patch now heavily relies on mlx5_lag_query_bond_speed() for QoS
max link speed, could this expose readers to observing torn or inconsistent
tracker states?
[CR] Preexisting, no impact. Either a value of 0 is read, or the correct one.
quoted
@@ -458,10 +461,17 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
 	MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
 		 vport_node->max_rate);
+	if (vport->dev != dev) {
+		/* The port is assigned to a node on another eswitch. */
+		MLX5_SET(vport_element, attr, eswitch_owner_vhca_id_valid,
+			 true);
+		MLX5_SET(vport_element, attr, eswitch_owner_vhca_id,
+			 MLX5_CAP_GEN(vport->dev, vhca_id));
Does this code silently truncate the vhca_id if it is >= 2048?
The hardware structure mlx5_ifc_vport_element_bits defines
eswitch_owner_vhca_id as an 11-bit field, but MLX5_CAP_GEN(..., vhca_id)
returns a 16-bit identifier. The MLX5_SET macro will silently apply a
mask of 0x7ff. Should there be a bounds check returning an error like
-EOVERFLOW here, or does the hardware actually support a 16-bit field?
[CR] vhca_id >= 2048 would imply a NIC with that many physical ports,
which is only possible in theory. Not worth addressing.
quoted
@@ -1221,8 +1260,15 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 	if (err)
 		return err;

-	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
+	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
 		esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
Does this condition skip the preservation and restoration of TC bandwidth
shares when transitioning between different node types?
When a vport transitions away from SCHED_NODE_TYPE_TC_ARBITER_TSAR
(e.g., to a regular SCHED_NODE_TYPE_VPORT), curr_type != type evaluates to
true, causing the function to skip saving the current tc_bw shares.
esw_qos_vport_disable() then unconditionally clears vport_node->tc_bw to
zero.
If the subsequent esw_qos_vport_enable() fails, the rollback path attempts
to restore the old TC_ARBITER_TSAR state.
However, because curr_type != type, the code skips restoring the bandwidth
shares via esw_qos_set_tc_arbiter_bw_shares().
Are the vport's bandwidth guarantees permanently erased in this failure
scenario?
[CR] Pre-existing, unlikely and low-impact: If tc-bw settings fail,
there will be an error and an opportunity for the user to retry.
quoted
@@ -1990,6 +1990,13 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
 	mlx5_eswitch_invalidate_wq(esw);
quoted
+
+	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
+		struct devlink *devlink = priv_to_devlink(esw->dev);
+
+		devl_rate_nodes_destroy(devlink);
+	}
+
Can this reordering cause a use-after-free if leaf unparenting fails?
Devlink's devl_rate_nodes_destroy() calls ops->rate_leaf_parent_set(..., NULL)
which maps to mlx5_esw_qos_vport_update_parent(). Devlink ignores the return
value of this callback.
If the hardware operation fails (e.g., due to a firmware timeout), the driver
leaves vport->qos.sched_node->parent pointing to the original parent node.
Devlink then unconditionally proceeds to destroy and free the rate node.
Later, when mlx5_eswitch_unload_vf_vports() executes, it calls
mlx5_esw_qos_vport_disable(), which reaches esw_qos_vport_disable():
drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c:esw_qos_vport_disable() {
    ...
	list_del_init(&vport_node->entry);
	esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
    ...
}
Will this dereference the freed vport_node->parent, resulting in list
corruption or a use-after-free?
[CR] This is a preexisting problem, brought to light by the reordering
of group destruction before leaf destruction. It's extremely unlikely,
requiring the firmware command to reparent a vport to its root to fail.
Fixing this properly requires multiple patches and will be pursued after
this series.
quoted
@@ -2039,6 +2040,9 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
 		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);

+	if (esw->mode == MLX5_ESWITCH_OFFLOADS)
+		devl_rate_nodes_destroy(devlink);
+
Does this identical reordering in the locked disable path suffer from the
same unparenting failure use-after-free described above?
[CR] Same comment as above. QoS improvements for the error paths will
follow.

V10:
- Added a comment in devl_rate_nodes_destroy clarifying locking.
- Expanded 'supported_cross_device_rate_nodes' comment with locking
  expectations.
- Simplified rate locking by only keeping the common ancestor locked.
- Removed devlink_nested_in_get_locked and devlink_nested_in_put_unlock.
- devlink_nl_rate_parent_node_set iterates over the proper rate list.
- Refactored mlx5 locking given dev->shd is now optional (after [3]).
- Fixed a bug in pruning introduced by the root node patch.
- Fixed a bug on failure when detaching a node from parent.
- Clarified expectations for shared devlink rate storage.
- Fixed incorrect net namespace when listing shared instances.

V9:
https://lore.kernel.org/netdev/20260326065949.44058-1-tariqt@nvidia.com/ (local)

Cosmin Ratiu (14):
  devlink: Update nested instance locking comment
  devlink: Add a helper for getting a nested-in instance
  devlink: Migrate from info->user_ptr to info->ctx
  devlink: Decouple rate storage from associated devlink object
  devlink: Add parent dev to devlink API
  devlink: Allow parent dev for rate-set and rate-new
  devlink: Allow rate node parents from other devlinks
  net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
  net/mlx5: qos: Refactor vport QoS cleanup
  net/mlx5: qos: Model the root node in the scheduling hierarchy
  net/mlx5: qos: Remove qos domains and use shd
  net/mlx5: qos: Support cross-device tx scheduling
  selftests: drv-net: Add test for cross-esw rate scheduling
  net/mlx5: Document devlink rates

 Documentation/netlink/specs/devlink.yaml      |  30 +-
 .../networking/devlink/devlink-port.rst       |   2 +
 Documentation/networking/devlink/index.rst    |   8 +-
 Documentation/networking/devlink/mlx5.rst     |  33 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   1 +
 .../mellanox/mlx5/core/esw/devlink_port.c     |   1 -
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 605 +++++++++---------
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |   3 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  27 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  18 +-
 include/net/devlink.h                         |   9 +
 include/uapi/linux/devlink.h                  |   2 +
 net/devlink/core.c                            |  20 +-
 net/devlink/dev.c                             |  16 +-
 net/devlink/devl_internal.h                   |  20 +
 net/devlink/dpipe.c                           |  14 +-
 net/devlink/health.c                          |  12 +-
 net/devlink/linecard.c                        |   4 +-
 net/devlink/netlink.c                         |  82 ++-
 net/devlink/netlink_gen.c                     |  24 +-
 net/devlink/netlink_gen.h                     |   8 +
 net/devlink/param.c                           |   4 +-
 net/devlink/port.c                            |  18 +-
 net/devlink/rate.c                            | 331 +++++++---
 net/devlink/region.c                          |   6 +-
 net/devlink/resource.c                        |  14 +-
 net/devlink/sb.c                              |  22 +-
 net/devlink/trap.c                            |  12 +-
 .../testing/selftests/drivers/net/hw/Makefile |   1 +
 .../drivers/net/hw/devlink_rate_cross_esw.py  | 296 +++++++++
 30 files changed, 1132 insertions(+), 511 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py


base-commit: 1c664ec4b9ea827b609d296921ed5bad8a40a158
-- 
2.44.0
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help