Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
From: Sagi Grimberg <sagi@grimberg.me>
Date: 2024-06-04 13:01:21
Also in:
ceph-devel, linux-block, linux-nvme
Subsystem:
block layer, nvm express driver, the rest · Maintainers:
Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg, Linus Torvalds
On 04/06/2024 11:24, Sagi Grimberg wrote:
On 04/06/2024 7:27, Christoph Hellwig wrote:quoted
On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:quoted
quoted
quoted
I still don't understand how a page in the middle of a contiguous range ends up coming from the slab while others don't.I haven't investigate the origin of the IO yet. I suspect the first 2 pages are the superblocks of the raid (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.Well, if these indeed are different origins and just *happen* to be a mixture of slab originated pages and non-slab pages combined together in a single bio of a bvec entry, I'd suspect that it would be more beneficial to split the bvec (essentially not allow bio_add_page to append the page to tail bvec depending on a queue limit (similar to how we handle sg gaps).So you want to add a PageSlab check to bvec_try_merge_page? That sounds fairly expensive..The check needs to happen somewhere apparently, and given that it will be gated by a queue flag only request queues that actually needed will suffer, but they will suffer anyways...
Something like the untested patch below: --
diff --git a/block/bio.c b/block/bio.c
index 53f608028c78..e55a4184c0e6 100644
--- a/block/bio.c
+++ b/block/bio.c@@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/blk-crypto.h> #include <linux/xarray.h> +#include <linux/net.h> #include <trace/events/block.h> #include "blk.h"
@@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv, return false; if (len > queue_max_segment_size(q) - bv->bv_len) return false; + if (q->limits.splice_pages && + sendpage_ok(bv->bv_page) ^ sendpage_ok(page)) + return false; return bvec_try_merge_page(bv, page, len, offset, same_page); }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a7e820840cf7..82e2719acb9c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c@@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl, lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; lim->max_segment_size = UINT_MAX; lim->dma_alignment = 3; + lim->splice_pages = ctrl->splice_pages; } static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3f3e26849b61..d9818330e236 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h@@ -398,6 +398,7 @@ struct nvme_ctrl { enum nvme_ctrl_type cntrltype; enum nvme_dctype dctype; + bool splice_pages }; static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 02076b8cb4d8..618b8f20206a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c@@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) if (error) goto out_stop_queue; + /* + * we want to prevent contig pages with conflicting splice-ability with + * respect to the network transmission + */ + ctrl->splice_pages = true; + nvme_unquiesce_admin_queue(ctrl); error = nvme_init_ctrl_finish(ctrl, false);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69c4f113db42..ec657ddad2a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h@@ -331,6 +331,12 @@ struct queue_limits { * due to possible offsets. */ unsigned int dma_alignment; + + /* + * Drivers that use MSG_SPLICE_PAGES to send the bvec over the
network, + * will need either bvec entry contig pages spliceable or not. + */ + bool splice_pages; }; typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, -- What I now see is that we will check PageSlab twice (bvec last index and append page) and skb_splice_from_iter checks it again... How many times check we check this :) Would be great if the network stack can just check it once and fallback to page copy...