Thread (15 messages) 15 messages, 5 authors, 2025-09-06

Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Date: 2025-09-06 04:31:53
Also in: linux-fsdevel, linux-xfs

Fengnan Chang [off-list ref] writes:
Ritesh Harjani [off-list ref] 于2025年8月27日周三 01:26写道:
quoted
Fengnan Chang [off-list ref] writes:
quoted
Christoph Hellwig [off-list ref] 于2025年8月25日周一 17:21写道:
quoted
On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
quoted
No restrictions for now, I think we can enable this by default.
Maybe better solution is modify in bio.c?  Let me do some test first.
If there are other implications to consider, for using per-cpu bio cache
by default, then maybe we can first get the optimizations for iomap in
for at least REQ_ALLOC_CACHE users and later work on to see if this
can be enabled by default for other users too.
Unless someone else thinks otherwise.

Why I am thinking this is - due to limited per-cpu bio cache if everyone
uses it for their bio submission, we may not get the best performance
where needed. So that might require us to come up with a different
approach.
Agree, if everyone uses it for their bio submission, we can not get the best
performance.
quoted
quoted
quoted
Any kind of numbers you see where this makes a different, including
the workloads would also be very valuable here.
I'm test random direct read performance on  io_uring+ext4, and try
compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
improve this, I found ext4 is quite different with blkdev when run
bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
path not. So I make this modify.
I am assuming you meant to say - DIO with iouring+raw_blkdev uses
per-cpu bio cache where as iouring+(ext4/xfs) does not use it.
Hence you added this patch which will enable the use of it - which
should also improve the performance of iouring+(ext4/xfs).
Yes. DIO+iouring+raw_blkdev vs DIO+iouring+(ext4/xfs).
quoted
That make sense to me.
quoted
My test command is:
/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
/data01/testfile
Without this patch:
BW is 1950MB
with this patch
BW is 2001MB.
I guess here you meant BW: XXXX MB/s
quoted
Ok. That's around 2.6% improvement.. Is that what you were expecting to
see too? Is that because you were testing with -p0 (non-polled I/O)?
I don't have a quantitative target for expectations, 2.6% seems reasonable.
Not related to -p0, with -p1, about 3.1% improvement.
Why we can't get 5-6% improvement? I think the biggest bottlenecks are
in ext4/xfs, most in ext4_es_lookup_extent.
Sure thanks for sharing the details. 
Could you add the performance improvements numbers along with the
io_uring cmd you shared above in the commit message in v2?

With that please feel free to add:

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
quoted
Looking at the numbers here [1] & [2], I was hoping this could give
maybe around 5-6% improvement ;)

[1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/ (local)
[2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/ (local)


-ritesh
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help