Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

From: Amir Goldstein <amir73il@gmail.com>
Date: 2019-06-01 07:22:04
Also in: linux-btrfs, linux-ext4, linux-fsdevel, linux-xfs

On Sat, Jun 1, 2019 at 1:46 AM Dave Chinner [off-list ref] wrote:

On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:

quoted

On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:

quoted

What do you think of:

"AT_ATOMIC_DATA (since Linux 5.x)
A filesystem which accepts this flag will guarantee that if the linked file
name exists after a system crash, then all of the data written to the file
and all of the file's metadata at the time of the linkat(2) call will be
visible.

".... will be visible after the the file system is remounted".  (Never
hurts to be explicit.)

quoted

The way to achieve this guarantee on old kernels is to call fsync (2)
before linking the file, but doing so will also results in flushing of
volatile disk caches.

A filesystem which accepts this flag does NOT
guarantee that any of the file hardlinks will exist after a system crash,
nor that the last observed value of st_nlink (see stat (2)) will persist."

This is I think more precise:

    This guarantee can be achieved by calling fsync(2) before linking
    the file, but there may be more performant ways to provide these
    semantics.  In particular, note that the use of the AT_ATOMIC_DATA
    flag does *not* guarantee that the new link created by linkat(2)
    will be persisted after a crash.

So here's the *implementation* problem I see with this definition of
AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
no guarantee that the data is on disk or that the link is present.

However:

        linkat(dirfd, name, AT_ATOMIC_DATA);
        fsync(dirfd);

Suddenly changes all that.

i.e. when we fsync(dirfd) we guarantee that "name" is present in the
directory and because we used AT_ATOMIC_DATA it implies that the
data pointed to by "name" must be present on disk. IOWs, what was
once a pure directory sync operation now *must* fsync all the child
inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
direct has been made stable.

IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
the pixels it is written on - it just moves all the complexity to
directory fsync, and that's /already/ a behavioural minefield.

Where does it say we don't have to write the data during linkat()?
I was only talking about avoid FLUSH/FUA caused by fsync().
I wrote in commit message:
"First implementation of AT_ATOMIC_DATA is expected to be
filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs."

I failed to convey the reasoning for this flag to you.
It is *not* about the latency of the "atomic link" for the calling thread
It is about not interfering with other workloads running at the same time.

IMO, the "work-around" of forcing filesystems to write back
destination inodes during a link() operation is just nasty and will
just end up with really weird performance anomalies occurring in
production systems. That's not really a solution, either, especially
as it is far, far faster for applications to use AIO_FSYNC and then
on the completion callback run a normal linkat() operation...

Hence, if I've understood these correctly, then I'll be recommending
that XFS follows this:

quoted

We should also document that a file system which does not implement
this flag MUST return EINVAL if it is passed this flag to linkat(2).

and returns -EINVAL to these flags because we do not have the change
tracking infrastructure to handle these directory fsync semantics.
I also suspect that, even if we could track this efficiently, we
can't do the flushing atomically because of locking order
constraints between directories, regular files, pages in the page
cache, etc.

That is not at all what I had in mind for XFS with the flag.

Given that we can already use AIO to provide this sort of ordering,
and AIO is vastly faster than synchronous IO, I don't see any point
in adding complex barrier interfaces that can be /easily implemented
in userspace/ using existing AIO primitives. You should start
thinking about expanding libaio with stuff like
"link_after_fdatasync()" and suddenly the whole problem of
filesystem data vs metadata ordering goes away because the
application directly controls all ordering without blocking and
doesn't need to care what the filesystem under it does....

OK. I can work with that. It's not that simple, but I will reply on
your next email, where you wrote more about this alternative.

Thanks,
Amir.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help