Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
From: Jeff Layton <hidden>
Date: 2017-03-30 11:24:00
Also in:
linux-btrfs, linux-fsdevel, linux-nfs, linux-xfs, lkml
On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote:
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:quoted
On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:quoted
On Tue 21-03-17 14:46:53, Jeff Layton wrote:quoted
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:quoted
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:quoted
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:quoted
- It's durable; the above comparison still works if there were reboots between the two i_version checks. - I don't know how realistic this is--we may need to figure out if there's a weaker guarantee that's still useful. Do filesystems actually make ctime/mtime/i_version changes atomically with the changes that caused them? What if a change attribute is exposed to an NFS client but doesn't make it to disk, and then that value is reused after reboot?Yeah, there could be atomicity there. If we bump i_version, we'll mark the inode dirty and I think that will end up with the new i_version at least being journalled before __mark_inode_dirty returns.So you think the filesystem can provide the atomicity? In more detail:Sorry, I hit send too quickly. That should have read: "Yeah, there could be atomicity issues there." I think providing that level of atomicity may be difficult, though maybe there's some way to make the querying of i_version block until the inode update has been journalled?Just to complement what Dave said from ext4 side - similarly as with XFS ext4 doesn't guarantee atomicity unless fsync() has completed on the file. Until that you can see arbitrary combination of data & i_version after the crash. We do take care to keep data and metadata in sync only when there are security implications to that (like exposing uninitialized disk blocks) and if not, we are as lazy as we can to improve performance...Yeah, I think what we'll have to do here is ensure that those filesystems do an fsync prior to reporting the i_version getattr codepath. It's not pretty, but I don't see a real alternative.I think that's even more problematic. ->getattr currently runs completely unlocked for performance reasons - it's racy w.r.t. to ongoing modifications to begin with, so /nothing/ that is returned to userspace via stat/statx can be guaranteed to be "coherent". Linus will be very unhappy if you make his git workload (which is /very/ stat heavy) run slower by adding any sort of locking in this hot path. Even if we did put an fsync() into ->getattr() (and dealt with all the locking issues that entails), by the time the statx syscall returns to userspace the i_version value may not match the data/metadata in the inode(*). IOWs, by the time i_version gets to userspace, it is out of date and any use of it for data versioning from userspace is going to be prone to race conditions. Cheers, Dave. (*) fiemap has exactly the same "stale the moment internal fs locks are released" race conditions, which is why it cannot safely be used for mapping holes when copying file data....
FWIW, I'm not terribly worried about atomicity for all of the reasons you describe. My main concern is reusing an i_version value that has already been handed out when the inode is now in an entirely different state. To that end, I was only considering doing an fsync iff STATX_VERSION was requested. If it wasn't we wouldn't need to do one. A lot of ->getattr implementations already call filemap_write_and_wait, but you're correct that flushing out the metadata is different matter. It'd be nice to avoid that if we can. -- Jeff Layton [off-list ref]