Re: [PATCH v5 07/14] iomap: track pending read bytes more optimally

From: Brian Foster <hidden>
Date: 2025-10-27 12:12:14
Also in: gfs2, linux-doc, linux-fsdevel, linux-xfs

On Fri, Oct 24, 2025 at 02:55:20PM -0700, Joanne Koong wrote:

On Fri, Oct 24, 2025 at 12:48 PM Joanne Koong [off-list ref] wrote:

quoted

On Fri, Oct 24, 2025 at 10:10 AM Brian Foster [off-list ref] wrote:

quoted

On Fri, Oct 24, 2025 at 09:25:13AM -0700, Joanne Koong wrote:

quoted

On Thu, Oct 23, 2025 at 5:01 PM Joanne Koong [off-list ref] wrote:

quoted

On Thu, Oct 23, 2025 at 12:30 PM Brian Foster [off-list ref] wrote:

quoted

On Thu, Sep 25, 2025 at 05:26:02PM -0700, Joanne Koong wrote:

quoted

Instead of incrementing read_bytes_pending for every folio range read in
(which requires acquiring the spinlock to do so), set read_bytes_pending
to the folio size when the first range is asynchronously read in, keep
track of how many bytes total are asynchronously read in, and adjust
read_bytes_pending accordingly after issuing requests to read in all the
necessary ranges.

iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
value for pending bytes necessarily indicates the folio is in the bio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
---

Hi Joanne,

I was throwing some extra testing at the vfs-6.19.iomap branch since the
little merge conflict thing with iomap_iter_advance(). I end up hitting
what appears to be a lockup on XFS with 1k FSB (-bsize=1k) running
generic/051. It reproduces fairly reliably within a few iterations or so
and seems to always stall during a read for a dedupe operation:

task:fsstress        state:D stack:0     pid:12094 tgid:12094 ppid:12091  task_flags:0x400140 flags:0x00080003
Call Trace:
 <TASK>
 __schedule+0x2fc/0x7a0
 schedule+0x27/0x80
 io_schedule+0x46/0x70
 folio_wait_bit_common+0x12b/0x310
 ? __pfx_wake_page_function+0x10/0x10
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 filemap_read_folio+0x85/0xd0
 ? __pfx_xfs_vm_read_folio+0x10/0x10 [xfs]
 do_read_cache_folio+0x7c/0x1b0
 vfs_dedupe_file_range_compare.constprop.0+0xaf/0x2d0
 __generic_remap_file_range_prep+0x276/0x2a0
 generic_remap_file_range_prep+0x10/0x20
 xfs_reflink_remap_prep+0x22c/0x300 [xfs]
 xfs_file_remap_range+0x84/0x360 [xfs]
 vfs_dedupe_file_range_one+0x1b2/0x1d0
 ? remap_verify_area+0x46/0x140
 vfs_dedupe_file_range+0x162/0x220
 do_vfs_ioctl+0x4d1/0x940
 __x64_sys_ioctl+0x75/0xe0
 do_syscall_64+0x84/0x800
 ? do_syscall_64+0xbb/0x800
 ? avc_has_perm_noaudit+0x6b/0xf0
 ? _copy_to_user+0x31/0x40
 ? cp_new_stat+0x130/0x170
 ? __do_sys_newfstat+0x44/0x70
 ? do_syscall_64+0xbb/0x800
 ? do_syscall_64+0xbb/0x800
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe6bbd9a14d
RSP: 002b:00007ffde72cd4e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007fe6bbd9a14d
RDX: 000000000a1394b0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 00007ffde72cd530 R08: 0000000000001000 R09: 000000000a11a3fc
R10: 000000000001d6c0 R11: 0000000000000246 R12: 000000000a12cfb0
R13: 000000000a12ba10 R14: 000000000a14e610 R15: 0000000000019000
 </TASK>

It wasn't immediately clear to me what the issue was so I bisected and
it landed on this patch. It kind of looks like we're failing to unlock a
folio at some point and then tripping over it later..? I can kill the
fsstress process but then the umount ultimately gets stuck tossing
pagecache [1], so the mount still ends up stuck indefinitely. Anyways,
I'll poke at it some more but I figure you might be able to make sense
of this faster than I can.

Brian

Hi Brian,

Thanks for your report and the repro instructions. I will look into
this and report back what I find.

This is the fix:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e6258fdb915..aa46fec8362d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c

@@ -445,6 +445,9 @@ static void iomap_read_end(struct folio *folio,

size_t bytes_pending)
                bool end_read, uptodate;
                size_t bytes_accounted = folio_size(folio) - bytes_pending;

+               if (!bytes_accounted)
+                       return;
+
                spin_lock_irq(&ifs->state_lock);


What I missed was that if all the bytes in the folio are non-uptodate
and need to read in by the filesystem, then there's a bug where the
read will be ended on the folio twice (in iomap_read_end() and when
the filesystem calls iomap_finish_folio_write(), when only the
filesystem should end the read), which does 2 folio unlocks which ends
up locking the folio. Looking at the writeback patch that does a
similar optimization [1], I miss the same thing there.

Makes sense.. though a short comment wouldn't hurt in there. ;) I found
myself a little confused by the accounted vs. pending naming when
reading through that code. If I follow correctly, the intent is to refer
to the additional bytes accounted to read_bytes_pending via the init
(where it just accounts the whole folio up front) and pending refers to
submitted I/O.

Presumably that extra accounting doubly serves as the typical "don't
complete the op before the submitter is done processing" extra
reference, except in this full submit case of course. If so, that's
subtle enough in my mind that a sentence or two on it wouldn't hurt..

I will add some a comment about this :) That's a good point about the
naming, maybe "bytes_submitted" and "bytes_unsubmitted" is a lot less
confusing than "bytes_pending" and "bytes_accounted".

Thinking about this some more, bytes_unsubmitted sounds even more
confusing, so maybe bytes_nonsubmitted or bytes_not_submitted. I'll
think about this some more but kept it as pending/accounted for now.

bytes_submitted sounds better than pending to me, not sure about
unsubmitted or whatever. As long as there's a sentence or two that
explains what accounted means in the end helper, though, that seems
reasonable enough to me.

Brian

The fix for this bug is here [1].

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251024215008.3844068-1-joannelkoong@gmail.com/ (local)

quoted

Thanks,
Joanne

quoted

I'll fix up both. Thanks for catching this and bisecting it down to
this patch. Sorry for the trouble.

No prob. Thanks for the fix!

Brian

quoted

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20251009225611.3744728-4-joannelkoong@gmail.com/ (local)

quoted

Thanks,
Joanne

quoted

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help