Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff

From: Amery Hung <hidden>
Date: 2025-09-04 17:26:28
Also in: bpf

On Thu, Aug 28, 2025 at 9:23 AM Dragos Tatulea [off-list ref] wrote:

On Wed, Aug 27, 2025 at 08:44:24PM -0700, Amery Hung wrote:

quoted

On Wed, Aug 27, 2025 at 6:45 AM Dragos Tatulea [off-list ref] wrote:

quoted

On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote:

quoted

[...]

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11..c5173f1ccb4e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

@@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
                           struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
      struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+     struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
      struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
-     struct mlx5e_wqe_frag_info *head_wi = wi;
      u16 rx_headroom = rq->buff.headroom;
      struct mlx5e_frag_page *frag_page;
      struct skb_shared_info *sinfo;
-     u32 frag_consumed_bytes;
+     u32 frag_consumed_bytes, i;
      struct bpf_prog *prog;
      struct sk_buff *skb;
      dma_addr_t addr;
      u32 truesize;
+     u8 nr_frags;
      void *va;

      frag_page = wi->frag_page;

@@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
      prog = rcu_dereference(rq->xdp_prog);
      if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
              if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
-                     struct mlx5e_wqe_frag_info *pwi;
+                     pwi = head_wi;
+                     while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
+                             pwi++;

Is this trying to skip counting the frags for the linear part? If yes,
don't understand the reasoning. If not, I don't follow the code.

AFAIU frags have to be counted for the linear part + sinfo->nr_frags.
Frags could be less after xdp program execution, but the linear part is
still there.

This is to search the first frag after xdp runs because I thought it
is possible that the first frag (head_wi+1) might be released by
bpf_xdp_pull_data() and then the frag will start from head_wi+2.

After sleeping on it a bit, it seems it is not possible as there is
not enough room in the linear to completely pull PAGE_SIZE byte of
data from the first frag to the linear area. Is this correct?

Right. AFAIU the usable linear part is smaller due to headroom and
tailroom.

[...]

quoted

              if (unlikely(!skb)) {
                      mlx5e_page_release_fragmented(rq->page_pool,

@@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
              mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);

              if (xdp_buff_has_frags(&mxbuf->xdp)) {
-                     struct mlx5e_frag_page *pagep;
+                     struct mlx5e_frag_page *pagep = head_page;
+
+                     truesize = nr_frags * PAGE_SIZE;

I am not sure that this is accurate. The last fragment might be smaller
than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz).

According to the truesize calculation in
mlx5e_skb_from_cqe_mpwrq_nonlinear() just before mlx5e_xdp_handle().
After the first frag, the frag_offset is always 0 and
pg_consumed_bytes will be PAGE_SIZE. Therefore the last page also
consumes a page, no?

My understanding is that the last pg_consumed_bytes will be a byte_cnt
that is smaller than PAGE_SIZE as there is a min operation.

The remaining byte_cnt will then be aligned to
BIT(rq->mpwqe.log_stride_sz), which is PAGE_SHIFT if there is xdp
program (per mlx5e_mpwqe_get_log_stride_size()). Therefore, it still
adds a page to truesize.

I will change to ALIGN(..., BIT(rq->mpwqe.log_stride_sz)) to be
consistent with existing code.

quoted

If the last page has variable size, I wonder how can
bpf_xdp_adjust_tail() handle a dynamic tailroom.

That is a good point. So this can stay as is I guess.

quoted

bpf_xdp_adjust_tail()
requires a driver to specify a static frag size (the maximum size a
frag can grow) when calling __xdp_rxq_info_reg(), which seem to be a
page in mlx5.

This is an issue raised by Nimrod as well. Currently striding rq sets
rxq->frag_size to 0. It is set to PAGE_SIZE only in legacy rq mode.

I see.

quoted

                      /* sinfo->nr_frags is reset by build_skb, calculate again. */
-                     xdp_update_skb_shared_info(skb, frag_page - head_page,
+                     xdp_update_skb_shared_info(skb, nr_frags,
                                                 sinfo->xdp_frags_size, truesize,
                                                 xdp_buff_is_frag_pfmemalloc(
                                                      &mxbuf->xdp));

-                     pagep = head_page;
-                     do
+                     while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
+                             pagep++;
+
+                     for (i = 0; i < nr_frags; i++, pagep++)
                              pagep->frags++;
-                     while (++pagep < frag_page);
+
+                     headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
+                     __pskb_pull_tail(skb, headlen);
              }
-             __pskb_pull_tail(skb, headlen);

What happens when there are no more frags? (bpf_xdp_frags_shrink_tail()
shrinked them out). Is that at all possible?

It is possible for bpf_xdp_frags_shrink_tail() to release all frags.
There is no limit of how much they can shrink. If there is linear
data, the kfunc allows shrinking data_end until ETH_HLEN. Before this
patchset, it could trigger a BUG_ON in __pskb_pull_tail(). After this
set, the driver will pass a empty skb to the upper layer.

I see what you mean.

quoted

For bpf_xdp_pull_data(), in the case of mlx5, I think it is only
possible to release all frags when the first and only frag contains
less than 256 bytes, which is the free space in the linear page.

Why would only 256 bytes be free in the linear area? My understanding
is that we have PAGE_SIZE - headroom - tailroom which should be more?

mlx5e_skb_from_cqe_mpwrq_nonlinear() currently sets xdp->frame_sz to
be XDP_PACKET_HEADROOM (256) + MLX5E_RX_MAX_HEAD (256) + sizeof(struct
skb_shared_info), so only 256 is available to xdp programs to pull
data in.

quoted

In general, I think the code would be nicer if it would do a rewind of
the end pointer based on the diff between the old and new nr_frags.

Not sure if I get this. Do you mean calling __pskb_pull_tail() some
how based on the difference between sinfo->nr_frags and nr_frags?

Thanks for reviewing the patch!

I was suggesting an approach for the whole patch that might be cleaner.

Roll back frag_page to the last used fragment after program execution:

        frag_page -= old_nr_frags - new_nr_frags;

... and after that you won't need to touch the frag counting loops
and the xdp_update_skb_shared_info().

Got it. This makes the change quite cleaner.

Thanks,
Dragos

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help