Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff

From: Amery Hung <hidden>
Date: 2025-08-28 03:44:36
Also in: bpf

On Wed, Aug 27, 2025 at 6:45 AM Dragos Tatulea [off-list ref] wrote:

On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote:

quoted

xdp programs can change the layout of an xdp_buff through
bpf_xdp_adjust_tail(), bpf_xdp_adjust_head(). Therefore, the driver
cannot assume the size of the linear data area nor fragments. Fix the
bug in mlx5e driver by generating skb according to xdp_buff layout.

Good find! Thanks for tackling this Amery.

quoted

Currently, when handling multi-buf xdp, the mlx5e driver assumes the
layout of an xdp_buff to be unchanged. That is, the linear data area
continues to be empty and the fragments remains the same.

This is true only for striding rq xdp. Legacy rq xdp puts the header
in the linear part.

quoted

This may
cause the driver to generate erroneous skb or triggering a kernel
warning. When an xdp program added linear data through
bpf_xdp_adjust_head() the linear data will be ignored as
mlx5e_build_linear_skb() builds an skb with empty linear data and then
pull data from fragments to fill the linear data area. When an xdp
program has shrunk the nonlinear data through bpf_xdp_adjust_tail(),
the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
data size and trigger the BUG_ON in it.

To fix the issue, first build the skb with linear data area matching
the xdp_buff. Then, call __pskb_pull_tail() to fill the linear data for
up to MLX5E_RX_MAX_HEAD bytes. In addition, recalculate nr_frags and
truesize after xdp program runs.

The ordering here seems misleading. AFAIU recalculating nr_frags happens
first.

quoted

Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
Signed-off-by: Amery Hung <redacted>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 59 ++++++++++++++-----
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11..c5173f1ccb4e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

@@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
                           struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
      struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+     struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
      struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
-     struct mlx5e_wqe_frag_info *head_wi = wi;
      u16 rx_headroom = rq->buff.headroom;
      struct mlx5e_frag_page *frag_page;
      struct skb_shared_info *sinfo;
-     u32 frag_consumed_bytes;
+     u32 frag_consumed_bytes, i;
      struct bpf_prog *prog;
      struct sk_buff *skb;
      dma_addr_t addr;
      u32 truesize;
+     u8 nr_frags;
      void *va;

      frag_page = wi->frag_page;

@@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
      prog = rcu_dereference(rq->xdp_prog);
      if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
              if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
-                     struct mlx5e_wqe_frag_info *pwi;
+                     pwi = head_wi;
+                     while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
+                             pwi++;

Is this trying to skip counting the frags for the linear part? If yes,
don't understand the reasoning. If not, I don't follow the code.

AFAIU frags have to be counted for the linear part + sinfo->nr_frags.
Frags could be less after xdp program execution, but the linear part is
still there.

This is to search the first frag after xdp runs because I thought it
is possible that the first frag (head_wi+1) might be released by
bpf_xdp_pull_data() and then the frag will start from head_wi+2.

After sleeping on it a bit, it seems it is not possible as there is
not enough room in the linear to completely pull PAGE_SIZE byte of
data from the first frag to the linear area. Is this correct?

quoted

-                     for (pwi = head_wi; pwi < wi; pwi++)
+                     for (i = 0; i < sinfo->nr_frags; i++, pwi++)
                              pwi->frag_page->frags++;

Why not:

Will fix it as well as other similar places.

        pwi = head_wi;
        for (int i = 0; i < (sinfo->nr_frags + 1); i++, pwi++)
                pwi->frag_page->frags++;

quoted

              }
              return NULL; /* page/packet was consumed by XDP */
      }

+     nr_frags = sinfo->nr_frags;

This makes sense. You are using this in xdp_update_skb_shared_info()
below.

quoted

+     pwi = head_wi + 1;
+
+     if (prog) {

You could do here: if (unlikely(sinfo->nr_frags != nr_frags).

Got it.

quoted

+             truesize = sinfo->nr_frags * frag_info->frag_stride;
+

Ack. Recalculating truesize.

quoted

+             while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
+                     pwi++;

Why is this needed here?

quoted

+     }

quoted

      skb = mlx5e_build_linear_skb(
              rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
              mxbuf->xdp.data - mxbuf->xdp.data_hard_start,

@@ -1796,12 +1809,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi

      if (xdp_buff_has_frags(&mxbuf->xdp)) {
              /* sinfo->nr_frags is reset by build_skb, calculate again. */
-             xdp_update_skb_shared_info(skb, wi - head_wi - 1,
+             xdp_update_skb_shared_info(skb, nr_frags,
                                         sinfo->xdp_frags_size, truesize,
                                         xdp_buff_is_frag_pfmemalloc(
                                              &mxbuf->xdp));

-             for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
+             for (i = 0; i < nr_frags; i++, pwi++)
                      pwi->frag_page->frags++;

Why not pull the pwi assignmet to head_wi + 1 up from the for scope and use i
with i < nr_frags condition?

quoted

@@ -2073,12 +2086,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
      }

      if (prog) {
+             u8 nr_frags;
+             u32 len, i;
+
              if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
                      if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
-                             struct mlx5e_frag_page *pfp;
+                             struct mlx5e_frag_page *pagep = head_page;
+
+                             while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
+                                     pagep++;

Why do you need this?

quoted

-                             for (pfp = head_page; pfp < frag_page; pfp++)
-                                     pfp->frags++;
+                             for (i = 0; i < sinfo->nr_frags; i++)
+                                     pagep->frags++;

This looks good here but with pfp = head_page. head_page should point to the first
frag. The linear part is in wi->linear_page.

quoted

                              wi->linear_page.frags++;
                      }

@@ -2087,9 +2106,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
                      return NULL; /* page/packet was consumed by XDP */
              }

+             len = mxbuf->xdp.data_end - mxbuf->xdp.data;
+             nr_frags = sinfo->nr_frags;
+
              skb = mlx5e_build_linear_skb(
                      rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
-                     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+                     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, len,
                      mxbuf->xdp.data - mxbuf->xdp.data_meta);

This makes sense.

quoted

              if (unlikely(!skb)) {
                      mlx5e_page_release_fragmented(rq->page_pool,

@@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
              mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);

              if (xdp_buff_has_frags(&mxbuf->xdp)) {
-                     struct mlx5e_frag_page *pagep;
+                     struct mlx5e_frag_page *pagep = head_page;
+
+                     truesize = nr_frags * PAGE_SIZE;

I am not sure that this is accurate. The last fragment might be smaller
than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz).

According to the truesize calculation in
mlx5e_skb_from_cqe_mpwrq_nonlinear() just before mlx5e_xdp_handle().
After the first frag, the frag_offset is always 0 and
pg_consumed_bytes will be PAGE_SIZE. Therefore the last page also
consumes a page, no?

If the last page has variable size, I wonder how can
bpf_xdp_adjust_tail() handle a dynamic tailroom. bpf_xdp_adjust_tail()
requires a driver to specify a static frag size (the maximum size a
frag can grow) when calling __xdp_rxq_info_reg(), which seem to be a
page in mlx5.

quoted

                      /* sinfo->nr_frags is reset by build_skb, calculate again. */
-                     xdp_update_skb_shared_info(skb, frag_page - head_page,
+                     xdp_update_skb_shared_info(skb, nr_frags,
                                                 sinfo->xdp_frags_size, truesize,
                                                 xdp_buff_is_frag_pfmemalloc(
                                                      &mxbuf->xdp));

-                     pagep = head_page;
-                     do
+                     while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
+                             pagep++;
+
+                     for (i = 0; i < nr_frags; i++, pagep++)
                              pagep->frags++;
-                     while (++pagep < frag_page);
+
+                     headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
+                     __pskb_pull_tail(skb, headlen);
              }
-             __pskb_pull_tail(skb, headlen);

What happens when there are no more frags? (bpf_xdp_frags_shrink_tail()
shrinked them out). Is that at all possible?

It is possible for bpf_xdp_frags_shrink_tail() to release all frags.
There is no limit of how much they can shrink. If there is linear
data, the kfunc allows shrinking data_end until ETH_HLEN. Before this
patchset, it could trigger a BUG_ON in __pskb_pull_tail(). After this
set, the driver will pass a empty skb to the upper layer.

For bpf_xdp_pull_data(), in the case of mlx5, I think it is only
possible to release all frags when the first and only frag contains
less than 256 bytes, which is the free space in the linear page.

In general, I think the code would be nicer if it would do a rewind of
the end pointer based on the diff between the old and new nr_frags.

Not sure if I get this. Do you mean calling __pskb_pull_tail() some
how based on the difference between sinfo->nr_frags and nr_frags?

Thanks for reviewing the patch!

Thanks,
Dragos

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help