Thread (14 messages) 14 messages, 5 authors, 2015-02-23

Re: + ext4-add-dax-functionality.patch added to -mm tree

From: Dave Chinner <david@fromorbit.com>
Date: 2015-02-18 21:55:48
Also in: linux-xfs

On Wed, Feb 18, 2015 at 11:40:09AM +0100, Jan Kara wrote:
On Tue 17-02-15 08:37:45, Matthew Wilcox wrote:
quoted
On Tue, Feb 17, 2015 at 09:52:00AM +0100, Jan Kara wrote:
quoted
quoted
quoted
This got added to fix a problem that Dave Chinner pointed out.  We need
the allocated extent to either be zeroed (as ext2 does), or marked as
unwritten (ext4, XFS) so that a racing read/page fault doesn't return
uninitialized data.  If it's marked as unwritten, we need to convert it
to a written extent after we've initialised the contents.  We use the
b_end_io() callback to do this, and it's called from the DAX code, not in
softirq context.
  OK, I see. But I didn't find where ->b_end_io gets called from dax code
(specifically I don't see it anywhere in dax_do_IO() or dax_io()). Can you
point me please?
For faults, we call it in dax_insert_mapping(), the very last thing
before returning in the fault path.  The normal I/O path gets to use
the dio_iodone_t for the same purpose.
  I see. I didn't think of races with reads (hum, I actually wonder whether
we don't have this data exposure problem for ext4 for mmapped write into
a hole vs direct read as well). So I guess we do need those unwritten
extent dances after all (or we would need to have a page covering hole when
writing to it via mmap but I guess unwritten extent dances are somewhat
more standard).
Right, that was the reason for doing it that way - it leveraged all
the existing methods we have for avoiding data exposure races in
XFS. but it's also not just for races - it's for ensuring that if we
crash between the allocation and the write to the persistent store
we don't expose the underlying contents when the system next comes
up.

These problems were found long ago on XFS and that's one of the
reasons why all direct IO block allocation uses unwritten extents -
until the data is on disk there is a window for stale data exposure
and things like crashing systems running high throughput IO are
very good at exposing such small windows of exposure. Hence I
thought it best to have DAX close that hole for everyone.
So in that case you need ext4_get_block_write() in the above call to
do_dax_fault() (note that we still do need ext4 version of the fault
function like above due to lock ranking and retry on ENOSPC). And please
comment about the read races at that call site so that we have that
subtlety documented somewhere.
Actually, I thought it was obvious that  ;)
quoted
quoted
quoted
Also abusing b_end_io of a phony buffer for that looks ugly to me (we are
trying to get away from passing phony bh around and this would entangle us
even more into that mess). Normally I would think that end_io() callback
passed into dax_do_io() should perform necessary conversions and for
dax_fault() we could do necessary conversions inside foofs_page_mkwrite()...
Dave sees to be the one trying the hardest to get rid of the phony BHs
... and it was his idea to (ab)use b_end_io for this.  The problem with
doing the conversion in ext4_page_mkwrite() is that we don't know at
that point whether the BH is unwritten or not.
  I see, thanks for explanation. So it would be enough to pass a bit of
information (unwritten / written) to a caller of do_dax_fault() and that
can then call conversion function. IMO doing that (either in a return value
or in a separate argument of do_dax_fault()) would be cleaner and more
readable than playing with b_private and b_end_io... Thoughts?
I'm happy for a better mechanism to be thought up. The current one
was expedient, but not pretty. The need for the end_io function was
because unwritten conversion needed to happen after the page was
zeroed. If we change dax_fault() to also take a "end_io" function
callback (already takes a get_blocks callback), then we can avoid
the need to use the phony bh for this purpose. i.e filesystems that
allocate unwritten extents can pass a completion function

I've got to update the XFS DAX patches I have for the next merge
cycle, so I'll take another at it when I page that information back
into my brain.

And speaking of phony BHs, the pnfs block layout changes introduce
an struct iomap and a "map_blocks" method to the export_ops in
exportfs.h. This is the model what we should be using instead of
phony BHs for block mapping/allocation operations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help