Re: [RFC PATCH 0/8] xfs: single block atomic writes for buffered IO

From: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Date: 2025-11-20 10:37:47
Also in: linux-block, linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml

On Tue, Nov 18, 2025 at 07:51:27AM +1100, Dave Chinner wrote:

On Mon, Nov 17, 2025 at 10:59:55AM +0000, John Garry wrote:

quoted

On 16/11/2025 08:11, Dave Chinner wrote:

quoted

This patch set focuses on HW accelerated single block atomic writes with
buffered IO, to get some early reviews on the core design.

What hardware acceleration? Hardware atomic writes are do not make
IO faster; they only change IO failure semantics in certain corner
cases.

I think that he references using REQ_ATOMIC-based bio vs xfs software-based
atomic writes (which reuse the CoW infrastructure). And the former is
considerably faster from my testing (for DIO, obvs). But the latter has not
been optimized.

Hi Dave,
Thanks for the review and insights.

Going through the discussions in previous emails and this email, I
understand that there are 2 main points/approaches that you've
mentioned:

1. Using COW extents to track atomic ranges
  - Discussed inline below.

2. Using write-through for RWF_ATOMIC buffered-IO (Suggested in [1])
	- [1] https://lore.kernel.org/linux-ext4/aRmHRk7FGD4nCT0s@dread.disaster.area/ (local)
  - I will respond inline in the above thread.

For DIO, REQ_ATOMIC IO will generally be faster than the software
fallback because no page cache interactions or data copy is required
by the DIO REQ_ATOMIC fast path.

But we are considering buffered writes, which *must* do a data copy,
and so the behaviour and performance differential of doing a COW vs
trying to force writeback to do REQ_ATOMIC IO is going to be much
different.

Consider that the way atomic buffered writes have been implemented
in writeback - turning off all folio and IO merging.  This means
writeback efficiency of atomic writes is going to be horrendous
compared to COW writes that don't use REQ_ATOMIC.

Yes, I agree that it is a bit of an overkill.

Further, REQ_ATOMIC buffered writes need to turn off delayed
allocation because if you can't allocate aligned extents then the
atomic write can *never* be performed. Hence we have to allocate up
front where we can return errors to userspace immediately, rather
than just reserve space and punt allocation to writeback. i.e. we
have to avoid the situation where we have dirty "atomic" data in the
page cache that cannot be written because physical allocation fails.

The likely outcome of turning off delalloc is that it further
degrades buffered atomic write writeback efficiency because it
removes the ability for the filesystem to optimise physical locality
of writeback IO. e.g. adjacent allocation across multiple small
files or packing of random writes in a single file to allow them to
merge at the block layer into one big IO...

REQ_ATOMIC is a natural fit for DIO because DIO is largely a "one
write syscall, one physical IO" style interface. Buffered writes,
OTOH, completely decouples application IO from physical IO, and so
there is no real "atomic" connection between the data being written
into the page caceh and the physical IO that is performed at some
time later.

This decoupling of physical IO is what brings all the problems and
inefficiencies. The filesystem being able to mark the RWF_ATOMIC
write range as a COW range at submission time creates a natural
"atomic IO" behaviour without requiring the page cache or writeback
to even care that the data needs to be written atomically.

From there, we optimise the COW IO path to record that
the new COW extent was created for the purpose of an atomic write.
Then when we go to write back data over that extent, the filesystem
can chose to do a REQ_ATOMIC write to do an atomic overwrite instead
of allocating a new extent and swapping the BMBT extent pointers at
IO completion time.

We really don't care if 4x16kB adjacent RWF_ATOMIC writes are
submitted as 1x64kB REQ_ATOMIC IO or 4 individual 16kB REQ_ATOMIC
IOs. The former is much more efficient from an IO perspective, and
the COW path can actually optimise for this because it can track the
atomic write ranges in cache exactly. If the range is larger (or
unaligned) than what REQ_ATOMIC can handle, we use COW writeback to
optimise for maximum writeback bandwidth, otherwise we use
REQ_ATOMIC to optimise for minimum writeback submission and
completion overhead...

Okay IIUC, you are suggesting that, instead of tracking the atomic
ranges in page cache and ifs, lets move that to the filesystem, for
example in XFS we can:

1. In write iomap_begin path, for RWF_ATOMIC, create a COW extent and
mark it as atomic. 

2. Carry on with the memcpy to folio and finish the write path.

3. During writeback, at XFS can detect that there is a COW atomic
extent. It can then:
  3.1 See that it is an overlap that can be done with REQ_ATOMIC
	directly 
	3.2 Else, finish the atomic IO in software emulated way just like we
	do for direct IO currently.

I believe the above example with XFS can also be extended to a FS like
ext4 without needing COW range, as long as we can ensure that we always
meet the conditions for REQ_ATOMIC during writeback (example by using
bigalloc for aligned extents and being careful not to cross the atomic
write limits)

IOWs, I think that for XFS (and other COW-capable filesystems) we
should be looking at optimising the COW IO path to use REQ_ATOMIC
where appropriate to create a direct overwrite fast path for
RWF_ATOMIC buffered writes. This seems a more natural and a lot less
intrusive than trying to blast through the page caceh abstractions
to directly couple userspace IO boundaries to physical writeback IO
boundaries...

I agree that this approach avoids bloating the page cache and ifs layers
with RWF_ATOMIC implementation details. That being said, the task of
managing the atomic ranges is now pushed down to the FS and is no longer
generic which might introduce friction in onboarding of new FSes in the
future. Regardless, from the discussion, I believe at this point we are
okay to make that trade-off.

Let me take some time to look into the XFS COW paths and try to implement
this approach. Thanks for the suggestion!

Regards,
ojaswin

-Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help