Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
From: "Darrick J. Wong" <djwong@kernel.org>
Date: 2023-10-09 21:05:40
Also in:
lkml
On Tue, Oct 10, 2023 at 07:39:17AM +1100, Dave Chinner wrote:
On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:quoted
On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:quoted
From: Himanshu Madhani <redacted> Add RWF_ATOMIC flag description for pwritev2(). Signed-off-by: Himanshu Madhani <redacted> #jpg: complete rewrite Signed-off-by: John Garry <john.g.garry@oracle.com> --- man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)....quoted
quoted
+For when regular files are opened with +.BR open (2) +but without +.B O_SYNC +or +.B O_DSYNC +and the +.BR pwritev2() +call is made without +.B RWF_SYNC +or +.BR RWF_DSYNC +set, the range metadata must already be flushed to storage and the data range +must not be in unwritten state, shared, a preallocation, or a hole.I think that we can drop all of these flags requirements, since the contiguous small space allocation requirement means that the fs can provide all-or-nothing writes even if metadata updates are needed: If the file range is allocated and marked unwritten (i.e. a preallocation), the ioend will clear the unwritten bit from the file mapping atomically. After a crash, the application sees either zeroes or all the data that was written. If the file range is shared, the ioend will map the COW staging extent into the file atomically. After a crash, the application sees either the old contents from the old blocks, or the new contents from the new blocks. If the file range is a sparse hole, the directio setup will allocate space and create an unwritten mapping before issuing the write bio. The rest of the process works the same as preallocations and has the same behaviors. If the file range is allocated and was previously written, the write is issued and that's all that's needed from the fs. After a crash, reads of the storage device produce the old contents or the new contents.This is exactly what I explained when reviewing the code that rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
I'm glad we agree. :) John, when you're back from vacation, can we get rid of this language and all those checks under _is_dsync() in the iomap patch? (That code is 100% the result of me handwaving and bellyaching 6 months ago when the team was trying to get all the atomic writes bits working prior to LSF and I was too burned out to think the xfs part through. As a result, I decided that we'd only support strict overwrites for the first iteration.)
quoted
Summarizing: An (ATOMIC|SYNC) request provides the strongest guarantees (data will not be torn, and all file metadata updates are persisted before the write is returned to userspace. Programs see either the old data or the new data, even if there's a crash. (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file updates for just that region are persisted before the write is returned. (ATOMIC) is the least strong -- data will not be torn. Neither the filesystem nor the device make guarantees that anything ended up on stable storage, but if it does, programs see either the old data or the new data.Yup, that makes sense to me.
Perhaps this ^^ is what we should be documenting here.
quoted
Maybe we should rename the whole UAPI s/atomic/untorn/...Perhaps, though "torn writes" is nomenclature that nobody outside storage and filesystem developers really knows about. All I ever hear from userspace developers is "we want atomic/all-or-nothing data writes"...
Fair 'enuf. --D
-Dave. -- Dave Chinner david@fromorbit.com