Re: [xdp-hints] [PATCH bpf-next v2 6/8] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: 2022-11-23 21:55:04
Also in:
bpf
On Wed, Nov 23, 2022 at 11:52:12AM -0800, sdf@google.com wrote:
On 11/23, Jakub Kicinski wrote:quoted
On Wed, 23 Nov 2022 10:26:41 -0800 Stanislav Fomichev wrote:quoted
quoted
This embedding trick works for drivers that put xdp_buff on the stack, but mlx5 supports XSK zerocopy, which uses the xsk_buff_pool for allocating them. This makes it a bit awkward to do the same thingthere;quoted
quoted
and since it's probably going to be fairly common to do something like this, how about we just add a 'void *drv_priv' pointer to struct xdp_buff that the drivers can use? The xdp_buff already takes up afullquoted
quoted
cache line anyway, so any data stuffed after it will spill over to anewquoted
quoted
one; so I don't think there's much difference performance-wise.I guess the alternative is to extend xsk_buff_pool with some new argument for xdp_buff tailroom? (so it can kmalloc(sizeof(xdp_buff) + xdp_buff_tailroom)) But it seems messy because there is no way of knowing what the target device's tailroom is, so it has to be a user setting :-/ I've started with a priv pointer in xdp_buff initially, it seems fine to go back. I'll probably convert veth/mlx4 to the same mode as well to avoid having different approaches in different places..quoted
Can we not do this please? Add 16B of "private driver space" after the xdp_buff in xdp_buff_xsk (we have 16B to full cacheline), the
It is time to jump the hints train I guess:D We have 8 bytes left in the cacheline that xdp_buff occupies - pahole output below shows that cb spans through two cachelines. Did you mean something else though?
quoted hunk ↗ jump to hunk
quoted
drivers decide how they use it. Drivers can do BUILD_BUG_ON() for their expected size and cast that to whatever struct they want. This is how various offloads work, the variable size tailroom would be an over design IMO.quoted
And this way non XSK paths can keep its normal typing.Good idea, prototyped below, lmk if it that's not what you had in mind. struct xdp_buff_xsk { struct xdp_buff xdp; /* 0 56 */ u8 cb[16]; /* 56 16 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ dma_addr_t dma; /* 72 8 */ dma_addr_t frame_dma; /* 80 8 */ struct xsk_buff_pool * pool; /* 88 8 */ u64 orig_addr; /* 96 8 */ struct list_head free_list_node; /* 104 16 */ /* size: 120, cachelines: 2, members: 7 */ /* last cacheline: 56 bytes */ }; Toke, I can try to merge this into your patch + keep your SoB (or feel free to try this and retest yourself, whatever works).diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.hb/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h index bc2d9034af5b..837bf103b871 100644--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h@@ -44,6 +44,11 @@ (MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \ sizeof(struct mlx5_wqe_inline_seg)) +struct mlx5_xdp_cb { + struct mlx5_cqe64 *cqe; + struct mlx5e_rq *rq; +}; + struct mlx5e_xsk_param; int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param*xsk); bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.cb/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c index c91b54d9ff27..84d23b2da7ce 100644--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c@@ -5,6 +5,7 @@ #include "en/xdp.h" #include <net/xdp_sock_drv.h> #include <linux/filter.h> +#include <linux/build_bug.h> /* RX data path */@@ -286,8 +287,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(structmlx5e_rq *rq, u32 cqe_bcnt) { struct xdp_buff *xdp = wi->au->xsk; + struct mlx5_xdp_cb *cb; struct bpf_prog *prog; + BUILD_BUG_ON(sizeof(struct mlx5_xdp_cb) > XSKB_CB_SIZE); + cb = xp_get_cb(xdp); + cb->cqe = NULL /*cqe*/; + cb->rq = rq;
I believe that these could be set once at a setup time within a pool - take a look at xsk_pool_set_rxq_info(). This will save us cycles so that we will skip assignments per each processed xdp_buff. AF_XDP ZC performance comes in a major part from the fact that thanks to xsk_buff_pool we have less work to do per each processed buffer.
quoted hunk ↗ jump to hunk
+ /* wi->offset is not used in this function, because xdp->data and the * DMA address point directly to the necessary place. Furthermore, the * XSK allocator allocates frames per packet, instead of pages, sodiff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index f787c3f524b0..b298590429e7 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h@@ -19,8 +19,11 @@ struct xdp_sock; struct device; struct page; +#define XSKB_CB_SIZE 16 + struct xdp_buff_xsk { struct xdp_buff xdp; + u8 cb[XSKB_CB_SIZE]; /* Private area for the drivers to use. */ dma_addr_t dma; dma_addr_t frame_dma; struct xsk_buff_pool *pool;@@ -143,6 +146,11 @@ static inline dma_addr_t xp_get_frame_dma(structxdp_buff_xsk *xskb) return xskb->frame_dma; } +static inline void *xp_get_cb(struct xdp_buff *xdp) +{ + return (void *)xdp + offsetof(struct xdp_buff_xsk, cb); +}
This should have a wrapper in include/net/xdp_sock_drv.h that drivers will call. Generally I think this should fly but I'm not sure about cb being 16 bytes.
+ void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb); static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb) {quoted
quoted
quoted
I'll send my patch to add support to mlx5 (using the drv_priv pointer approach) separately.Saw them, thanks! Will include them in v3+.