Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Amir Goldstein <amir73il@gmail.com>
Date: 2019-06-03 06:17:33
Also in:
linux-btrfs, linux-ext4, linux-fsdevel, linux-xfs
quoted
Actually, one of my use cases is "atomic rename" of files with no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename" thread should not be interfering with other workloads at all.Which should already guaranteed because a) rename is supposed to be atomic, and b) metadata ordering requirements in journalled filesystems. If they lose xattrs across rename, there's something seriously wrong with the filesystem implementation. I'm really not sure what you think filesystems are actually doing with metadata across rename operations....
Dave, We are going in circles so much that my head is spinning. I don't blame anyone for having a hard time to keep up with the plot, because it spans many threads and subjects, so let me re-iterate: - I *do* know that rename provides me the needed "metadata barrier" w.r.t. xattr on xfs/ext4 today. - I *do* know the sync_file_range()+rename() callback provides the "data barrier" I need on xfs/ext4 today. - I *do* use this internal fs knowledge in my applications - I even fixed up sync_file_range() per your suggestion, so I won't need to use the FIEMAP_FLAG_SYNC hack - At attempt from CrashMonkey developers to document this behavior was "shot down" for many justified reasons - Without any documentation nor explicit API with a clean guarantee, users cannot write efficient applications without being aware of the filesystem underneath and follow that filesystem development to make sure behavior has not changed - The most recent proposal I have made in LSF, based on Jan's suggestion is to change nothing in filesystem implementation, but use a new *explicit* verb to communicate the expectation of the application, so that filesystems are free the change behavior in the future in the absence of the new verb Once again, ATOMIC_METADATA is a noop in preset xfs/ext4. ATOMIC_DATA is sync_file_range() in present xfs/ext4. The APIs I *need* from the kernel *do* exist, but the filesystem developers (except xfs) are not willing to document the guarantee that the existing interfaces provide in the present. [...]
So, in the interests of /informed debate/, please implement what you want using batched AIO_FSYNC + rename/linkat completion callback and measure what it acheives. Then implement a sync_file_range/linkat thread pool that provides the same functionality to the application (i.e. writeback concurrency in userspace) and measure it. Then we can discuss what the relative overhead is with numbers and can perform analysis to determine what the cause of the performance differential actually is.
Fare enough.
Neither of these things require kernel modifications, but you need to provide the evidence that existing APIs are insufficient.
APIs are sufficient if I know which filesystem I am running on. btrfs needs a different set of syscalls to get the same thing done.
Indeed, we now have the new async ioring stuff that can run async sync_file_range calls, so you probably need to benchmark replacing AIO_FSYNC with that interface as well. This new API likely does exactly what you want without the journal/device cache flush overhead of AIO_FSYNC....
Indeed, I am keeping a close watch on io_uring. Thanks, Amir.