Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag

From: "Darrick J. Wong" <djwong@kernel.org>
Date: 2023-10-09 21:05:40
Also in: lkml

On Tue, Oct 10, 2023 at 07:39:17AM +1100, Dave Chinner wrote:

On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:

quoted

On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:

quoted

From: Himanshu Madhani <redacted>

Add RWF_ATOMIC flag description for pwritev2().

Signed-off-by: Himanshu Madhani <redacted>
#jpg: complete rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

....

quoted

+For when regular files are opened with
+.BR open (2)
+but without
+.B O_SYNC
+or
+.B O_DSYNC
+and the
+.BR pwritev2()
+call is made without
+.B RWF_SYNC
+or
+.BR RWF_DSYNC
+set, the range metadata must already be flushed to storage and the data range
+must not be in unwritten state, shared, a preallocation, or a hole.

I think that we can drop all of these flags requirements, since the
contiguous small space allocation requirement means that the fs can
provide all-or-nothing writes even if metadata updates are needed:

If the file range is allocated and marked unwritten (i.e. a
preallocation), the ioend will clear the unwritten bit from the file
mapping atomically.  After a crash, the application sees either zeroes
or all the data that was written.

If the file range is shared, the ioend will map the COW staging extent
into the file atomically.  After a crash, the application sees either
the old contents from the old blocks, or the new contents from the new
blocks.

If the file range is a sparse hole, the directio setup will allocate
space and create an unwritten mapping before issuing the write bio.  The
rest of the process works the same as preallocations and has the same
behaviors.

If the file range is allocated and was previously written, the write is
issued and that's all that's needed from the fs.  After a crash, reads
of the storage device produce the old contents or the new contents.

This is exactly what I explained when reviewing the code that
rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.

I'm glad we agree. :)

John, when you're back from vacation, can we get rid of this language
and all those checks under _is_dsync() in the iomap patch?

(That code is 100% the result of me handwaving and bellyaching 6 months
ago when the team was trying to get all the atomic writes bits working
prior to LSF and I was too burned out to think the xfs part through.
As a result, I decided that we'd only support strict overwrites for the
first iteration.)

quoted

Summarizing:

An (ATOMIC|SYNC) request provides the strongest guarantees (data
will not be torn, and all file metadata updates are persisted before
the write is returned to userspace.  Programs see either the old data or
the new data, even if there's a crash.

(ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
updates for just that region are persisted before the write is returned.

(ATOMIC) is the least strong -- data will not be torn.  Neither the
filesystem nor the device make guarantees that anything ended up on
stable storage, but if it does, programs see either the old data or the
new data.

Yup, that makes sense to me.

Perhaps this ^^ is what we should be documenting here.

quoted

Maybe we should rename the whole UAPI s/atomic/untorn/...

Perhaps, though "torn writes" is nomenclature that nobody outside
storage and filesystem developers really knows about. All I ever
hear from userspace developers is "we want atomic/all-or-nothing
data writes"...

Fair 'enuf.

--D

-Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help