Thread (30 messages) 30 messages, 6 authors, 2024-06-06

Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()

From: Sagi Grimberg <sagi@grimberg.me>
Date: 2024-06-04 13:01:21
Also in: ceph-devel, linux-block, linux-nvme
Subsystem: block layer, nvm express driver, the rest · Maintainers: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg, Linus Torvalds


On 04/06/2024 11:24, Sagi Grimberg wrote:

On 04/06/2024 7:27, Christoph Hellwig wrote:
quoted
On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
quoted
quoted
quoted
I still don't understand how a page in the middle of a contiguous 
range ends
up coming from the slab while others don't.
I haven't investigate the origin of the IO
yet. I suspect the first 2 pages are the superblocks of the raid
(mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the 
bitmap.
Well, if these indeed are different origins and just *happen* to be a
mixture
of slab originated pages and non-slab pages combined together in a 
single
bio of a bvec entry,
I'd suspect that it would be more beneficial to split the bvec 
(essentially
not allow bio_add_page
to append the page to tail bvec depending on a queue limit (similar 
to how
we handle sg gaps).
So you want to add a PageSlab check to bvec_try_merge_page? That sounds
fairly expensive..
The check needs to happen somewhere apparently, and given that it will 
be gated by a queue flag
only request queues that actually needed will suffer, but they will 
suffer anyways...
Something like the untested patch below:
--
diff --git a/block/bio.c b/block/bio.c
index 53f608028c78..e55a4184c0e6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
  #include <linux/highmem.h>
  #include <linux/blk-crypto.h>
  #include <linux/xarray.h>
+#include <linux/net.h>

  #include <trace/events/block.h>
  #include "blk.h"
@@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q, 
struct bio_vec *bv,
                 return false;
         if (len > queue_max_segment_size(q) - bv->bv_len)
                 return false;
+       if (q->limits.splice_pages &&
+           sendpage_ok(bv->bv_page) ^ sendpage_ok(page))
+                       return false;
         return bvec_try_merge_page(bv, page, len, offset, same_page);
  }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a7e820840cf7..82e2719acb9c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl 
*ctrl,
         lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
         lim->max_segment_size = UINT_MAX;
         lim->dma_alignment = 3;
+       lim->splice_pages = ctrl->splice_pages;
  }

  static bool nvme_update_disk_info(struct nvme_ns *ns, struct 
nvme_id_ns *id,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3f3e26849b61..d9818330e236 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -398,6 +398,7 @@ struct nvme_ctrl {

         enum nvme_ctrl_type cntrltype;
         enum nvme_dctype dctype;
+       bool splice_pages
  };

  static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 02076b8cb4d8..618b8f20206a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct 
nvme_ctrl *ctrl, bool new)
         if (error)
                 goto out_stop_queue;

+       /*
+        * we want to prevent contig pages with conflicting 
splice-ability with
+        * respect to the network transmission
+        */
+       ctrl->splice_pages = true;
+
         nvme_unquiesce_admin_queue(ctrl);

         error = nvme_init_ctrl_finish(ctrl, false);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69c4f113db42..ec657ddad2a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,12 @@ struct queue_limits {
          * due to possible offsets.
          */
         unsigned int            dma_alignment;
+
+       /*
+        * Drivers that use MSG_SPLICE_PAGES to send the bvec over the 
network,
+        * will need either bvec entry contig pages spliceable or not.
+        */
+       bool                    splice_pages;
  };

  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
--

What I now see is that we will check PageSlab twice (bvec last index and 
append page)
and skb_splice_from_iter checks it again... How many times check we 
check this :)

Would be great if the network stack can just check it once and fallback 
to page copy...
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help