Re: [PATCH v6 4/7] dax: add support for fsync/msync | linux-ext4

quoted

On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
[off-list ref] wrote:
On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <redacted>
...
+static int dax_writeback_one(struct block_device *bdev,
+           struct address_space *mapping, pgoff_t index, void *entry)
+{
+   struct radix_tree_root *page_tree = &mapping->page_tree;
+   int type = RADIX_DAX_TYPE(entry);
+   struct radix_tree_node *node;
+   struct blk_dax_ctl dax;
+   void **slot;
+   int ret = 0;
+
+   spin_lock_irq(&mapping->tree_lock);
+   /*
+    * Regular page slots are stabilized by the page lock even
+    * without the tree itself locked.  These unlocked entries
+    * need verification under the tree lock.
+    */
+   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+           goto unlock;
+   if (*slot != entry)
+           goto unlock;
+
+   /* another fsync thread may have already written back this entry */
+   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+           goto unlock;
+
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+           ret = -EIO;
+           goto unlock;
+   }
+
+   dax.sector = RADIX_DAX_SECTOR(entry);
+   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+   spin_unlock_irq(&mapping->tree_lock);
+
+   /*
+    * We cannot hold tree_lock while calling dax_map_atomic() because it
+    * eventually calls cond_resched().
+    */
+   ret = dax_map_atomic(bdev, &dax);
+   if (ret < 0)
+           return ret;
+
+   if (WARN_ON_ONCE(ret < dax.size)) {
+           ret = -EIO;
+           dax_unmap_atomic(bdev, &dax);
+           return ret;
+   }
+
+   spin_lock_irq(&mapping->tree_lock);
+   /*
+    * We need to revalidate our radix entry while holding tree_lock
+    * before we do the writeback.
+    */
Do we really need to revalidate here? dax_map_atomic() makes sure the addr
& size is still part of the device. I guess you are concerned that due to
truncate or similar operation those sectors needn't belong to the same file
anymore but we don't really care about flushing sectors for someone else,
do we?

Otherwise the patch looks good to me.
Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.
My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.
I don't think that the addresses have the potential of being invalid from the
driver's point of view - we are still holding a reference on the block queue
via dax_map_atomic(), so we should be protected against races vs block device
removal.  I think the only question is whether it is okay to flush an address
that we know to be valid from the block device's point of view, but which the
filesystem may have truncated from being allocated to our inode.

Does that all make sense?

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help