Re: [PATCH 0/9] add ext4 per-inode DAX flag

From: Jan Kara <jack@suse.cz>
Date: 2017-09-08 09:48:59
Also in: linux-xfs, lkml, nvdimm

On Fri 08-09-17 09:25:43, Dave Chinner wrote:

On Thu, Sep 07, 2017 at 04:19:00PM -0600, Ross Zwisler wrote:

quoted

On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:

quoted

On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:

quoted

On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:

quoted

However, I wonder if this could
be prevented at runtime, and only allow S_DAX to be set when the inode is
first instantiated, and wouldn't be allowed to change after that?  Setting
or clearing the per-inode DAX flag might still be allowed, but it wouldn't
be enabled until the inode is next fetched into cache?  Similarly, for
inodes that have conflicting features (e.g. inline data or encryption)
would not be allowed to enable S_DAX.

Ooh, this seems interesting.  This would ensure that S_DAX transitions
couldn't ever race with I/Os or mmaps().  I had some other ideas for how to
handle this, but I think your idea is more promising. :)

IMO, that's an awful admin interface - it can't be done on demand
(i.e. when needed) because we can't force an inode to be evicted
from the cache. And then we have the "why the hell did that just
change" problem if an inode is evicted due to memory pressure and
then immediately reinstantiated by the running workload. That's a
recipe for driving admins insane...

quoted

I guess with this solution we'd need:

a) A good way of letting the user detect the state where they had set the DAX
inode flag, but that it wasn't yet in use by the inode.

b) A reliable way of flushing the inode from the filesystem cache, so that the
next time an open() happens they get the new behavior.  The way I usually do
this is via umount/remount, but there is probably already a way to do this?

Not if it's referenced. And if it's not referenced, then the only
hammer we have is Brutus^Wdrop_caches. That's not an option for
production machines.

Neat idea, but one I'd already thought of and discarded as "not
practical from an admin perspective".

Okay, so other ideas (which you have also probably already though of) include:

1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
open mappings or any open file handles.

You have to have an open fd to change the flag. :)

Yeah, open file handles don't matter and we can serialize against IO in
progress, that's not a big deal. Established mappings are difficult to deal
with.

quoted

To prevent TOCTOU races we'd have to
do some additional locking while actually changing the flag.

I think that make sense - the fundamental problem is that the
mappings are different between dax and non-dax, and that we can't
properly lock out page faults to to prevent sending a racing
page fault down the wrong path.

quoted

2) Be more drastic and follow the flow of ext4 file based encryption, only
allowing the inode flag to be set by an admin on an empty directory.  Files in
that directory will inherit it when they are created, and we don't provide a
way to clear.  If you want your file to not use DAX, move it to a different
directory (which I think for ext4 encryption turns it into a new inode).

Seems like the wrong model to me - moving application data files
is a PITA because you've also go to change the app config to point
at the new location...

Agreed.

quoted

Other ideas?

IMO, we need to fix the page fault path so we don't look at inode
flags to determine processing behaviour during the fault. Fault
processing as DAX or non-dax needs to be determined by the page
fault code and communicated to the fs via the vmf as the contents
of the vmf for a dax fault can be invalid for a non-dax fault. Fixing
that problem (i.e. make DAX is a property of the mapping and
instantiate it from the inode only at mmap() time) means all the
page fault vs inode flag race problems go away and we have a model
that is much more robust if we want to expand it in future.

In fact, the real problem is only with .page_mkwrite and .pfn_mkwrite
callbacks. For those setup of 'vmf' differs. For .fault or .huge_fault the
vmf is the same regardless whether we do DAX or non-DAX fault. But it seems
difficult to me to determine DAX / non-DAX fault in vmf since locks
necessary to stabilize S_DAX flag are acquired only in filesystem-specific
handlers (and the locks themselves are fs specific).

So the only way I see of dealing safely with these races is careful
checking in .page_mkwrite and .pfn_mkwrite after necessary locks are
obtained and bail out doing nothing if state is inconsistent. VM will retry
the fault and we'll get to the correct handler next time.

But if we disallow any mappings when switching S_DAX flag, then all the
above is moot and there can be no races... We just have to be sure to block
new mappings of the file while switching the flag.

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help