Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
From: John Garry <john.g.garry@oracle.com>
Date: 2023-10-24 12:35:49
Also in:
lkml
On 09/10/2023 22:05, Darrick J. Wong wrote:
quoted
quoted
If the file range is a sparse hole, the directio setup will allocate space and create an unwritten mapping before issuing the write bio. The rest of the process works the same as preallocations and has the same behaviors. If the file range is allocated and was previously written, the write is issued and that's all that's needed from the fs. After a crash, reads of the storage device produce the old contents or the new contents.This is exactly what I explained when reviewing the code that rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.I'm glad we agree. 😄 John, when you're back from vacation, can we get rid of this language and all those checks under _is_dsync() in the iomap patch? (That code is 100% the result of me handwaving and bellyaching 6 months ago when the team was trying to get all the atomic writes bits working prior to LSF and I was too burned out to think the xfs part through. As a result, I decided that we'd only support strict overwrites for the first iteration.)
So this following additive code in iomap_dio_bio_iter() should be dropped: ----8<-----
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c@@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
struct iomap_dio *dio)
{
...
@@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
+ if (atomic_write && !iocb_is_dsync(dio->iocb)) {
+ if (iomap->flags & IOMAP_F_DIRTY)
+ return -EIO;
+ if (iomap->type != IOMAP_MAPPED)
+ return -EIO;
+ }
+
---->8-----
ok?
quoted
quoted
Summarizing: An (ATOMIC|SYNC) request provides the strongest guarantees (data will not be torn, and all file metadata updates are persisted before the write is returned to userspace. Programs see either the old data or the new data, even if there's a crash. (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file updates for just that region are persisted before the write is returned. (ATOMIC) is the least strong -- data will not be torn. Neither the filesystem nor the device make guarantees that anything ended up on stable storage, but if it does, programs see either the old data or the new data.Yup, that makes sense to me.Perhaps this ^^ is what we should be documenting here.quoted
quoted
Maybe we should rename the whole UAPI s/atomic/untorn/...Perhaps, though "torn writes" is nomenclature that nobody outside storage and filesystem developers really knows about. All I ever hear from userspace developers is "we want atomic/all-or-nothing data writes"...Fair 'enuf.
Thanks, John