Thread (87 messages) 87 messages, 8 authors, 2017-10-30

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

From: Jeff Layton <hidden>
Date: 2017-03-30 11:24:00
Also in: linux-btrfs, linux-fsdevel, linux-nfs, linux-xfs, lkml

On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote:
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
quoted
On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
quoted
On Tue 21-03-17 14:46:53, Jeff Layton wrote:
quoted
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
quoted
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
quoted
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
quoted
- It's durable; the above comparison still works if there were reboots
  between the two i_version checks.
	- I don't know how realistic this is--we may need to figure out
	  if there's a weaker guarantee that's still useful.  Do
	  filesystems actually make ctime/mtime/i_version changes
	  atomically with the changes that caused them?  What if a
	  change attribute is exposed to an NFS client but doesn't make
	  it to disk, and then that value is reused after reboot?
Yeah, there could be atomicity there. If we bump i_version, we'll mark
the inode dirty and I think that will end up with the new i_version at
least being journalled before __mark_inode_dirty returns.
So you think the filesystem can provide the atomicity?  In more detail:
Sorry, I hit send too quickly. That should have read:

"Yeah, there could be atomicity issues there."

I think providing that level of atomicity may be difficult, though
maybe there's some way to make the querying of i_version block until
the inode update has been journalled?
Just to complement what Dave said from ext4 side - similarly as with XFS
ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
Until that you can see arbitrary combination of data & i_version after the
crash. We do take care to keep data and metadata in sync only when there
are security implications to that (like exposing uninitialized disk blocks)
and if not, we are as lazy as we can to improve performance...
Yeah, I think what we'll have to do here is ensure that those
filesystems do an fsync prior to reporting the i_version getattr
codepath. It's not pretty, but I don't see a real alternative.
I think that's even more problematic. ->getattr currently runs
completely unlocked for performance reasons - it's racy w.r.t. to
ongoing modifications to begin with, so /nothing/ that is returned
to userspace via stat/statx can be guaranteed to be "coherent".
Linus will be very unhappy if you make his git workload (which is
/very/ stat heavy) run slower by adding any sort of locking in this
hot path.

Even if we did put an fsync() into ->getattr() (and dealt with all
the locking issues that entails), by the time the statx syscall
returns to userspace the i_version value may not match the
data/metadata in the inode(*).  IOWs, by the time i_version gets
to userspace, it is out of date and any use of it for data
versioning from userspace is going to be prone to race conditions.

Cheers,

Dave.

(*) fiemap has exactly the same "stale the moment internal fs
locks are released" race conditions, which is why it cannot safely
be used for mapping holes when copying file data....
FWIW, I'm not terribly worried about atomicity for all of the reasons
you describe. My main concern is reusing an i_version value that has
already been handed out when the inode is now in an entirely different
state. To that end, I was only considering doing an fsync iff
STATX_VERSION was requested. If it wasn't we wouldn't need to do one.

A lot of ->getattr implementations already call filemap_write_and_wait,
but you're correct that flushing out the metadata is different matter. 
It'd be nice to avoid that if we can.

-- 
Jeff Layton [off-list ref]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help