Thread (33 messages) 33 messages, 5 authors, 2019-11-11

Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

From: Jan Kara <jack@suse.cz>
Date: 2019-11-08 13:46:10
Also in: linux-fsdevel, linux-xfs, lkml

On Fri 08-11-19 14:12:38, Jan Kara wrote:
On Mon 21-10-19 15:49:31, Ira Weiny wrote:
quoted
On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
quoted
On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
That, fundamentally, is the issue here - it's not setting/clearing
the DAX flag that is the issue, it's doing a swap of the
mapping->a_ops while there may be other code using that ops
structure.

IOWs, if there is any code anywhere in the kernel that
calls an address space op without holding one of the three locks we
hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
of the address space operations.

By limiting the address space swap to file sizes of zero, we rule
out the page fault path (mmap of a zero length file segv's with an
access beyond EOF on the first read/write page fault, right?).
Yes I checked that and thought we were safe here...
quoted
However, other aops callers that might run unlocked and do the wrong
thing if the aops pointer is swapped between check of the aop method
existing and actually calling it even if the file size is zero?

A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
to such a race condition with the current definitions of the XFS DAX
aops. I'm guessing there will be others, but I haven't looked
further than this...
I'll check for others and think on what to do about this.  ext4 will have the
same problem I think.  :-(
Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
existing inodes when switching journal data flag and so far it has not
blown up. What we did to deal with issues Dave describes is that we
introduced percpu rw-semaphore guarding switching of aops and then inside
problematic functions redirect callbacks in the right direction under this
semaphore. Somewhat ugly but it seems to work.
Thinking about this some more, perhaps this scheme could be actually
transformed in something workable. We could have a global (or maybe per-sb
but I'm not sure it's worth it) percpu rwsem and we could transform aops
calls into:

percpu_down_read(aops_rwsem);
do_call();
percpu_up_read(aops_rwsem);

With some macro magic it needn't be even that ugly.

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help