Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

From: Amir Goldstein <amir73il@gmail.com>
Date: 2019-06-03 06:17:33
Also in: linux-btrfs, linux-ext4, linux-fsdevel, linux-xfs

quoted

Actually, one of my use cases is "atomic rename" of files with
no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename"
thread should not be interfering with other workloads at all.

Which should already guaranteed because a) rename is supposed to be
atomic, and b) metadata ordering requirements in journalled
filesystems. If they lose xattrs across rename, there's something
seriously wrong with the filesystem implementation.  I'm really not
sure what you think filesystems are actually doing with metadata
across rename operations....

Dave,

We are going in circles so much that my head is spinning.
I don't blame anyone for having a hard time to keep up with the plot, because
it spans many threads and subjects, so let me re-iterate:

- I *do* know that rename provides me the needed "metadata barrier"
  w.r.t. xattr on xfs/ext4 today.
- I *do* know the sync_file_range()+rename() callback provides the
"data barrier"
  I need on xfs/ext4 today.
- I *do* use this internal fs knowledge in my applications
- I even fixed up sync_file_range() per your suggestion, so I won't need to use
  the FIEMAP_FLAG_SYNC hack
- At attempt from CrashMonkey developers to document this behavior was
  "shot down" for many justified reasons
- Without any documentation nor explicit API with a clean guarantee, users
  cannot write efficient applications without being aware of the filesystem
  underneath and follow that filesystem development to make sure behavior
  has not changed
- The most recent proposal I have made in LSF, based on Jan's suggestion is
  to change nothing in filesystem implementation, but use a new *explicit* verb
  to communicate the expectation of the application, so that
filesystems are free
  the change behavior in the future in the absence of the new verb

Once again, ATOMIC_METADATA is a noop in preset xfs/ext4.
ATOMIC_DATA is sync_file_range() in present xfs/ext4.
The APIs I *need* from the kernel *do* exist, but the filesystem developers
(except xfs) are not willing to document the guarantee that the existing
interfaces provide in the present.

[...]

So, in the interests of /informed debate/, please implement what you
want using batched AIO_FSYNC + rename/linkat completion callback and
measure what it acheives. Then implement a sync_file_range/linkat
thread pool that provides the same functionality to the application
(i.e. writeback concurrency in userspace) and measure it. Then we
can discuss what the relative overhead is with numbers and can
perform analysis to determine what the cause of the performance
differential actually is.

Fare enough.

Neither of these things require kernel modifications, but you need
to provide the evidence that existing APIs are insufficient.

APIs are sufficient if I know which filesystem I am running on.
btrfs needs a different set of syscalls to get the same thing done.

Indeed, we now have the new async ioring stuff that can run async
sync_file_range calls, so you probably need to benchmark replacing
AIO_FSYNC with that interface as well. This new API likely does
exactly what you want without the journal/device cache flush
overhead of AIO_FSYNC....

Indeed, I am keeping a close watch on io_uring.

Thanks,
Amir.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help