Re: Subtle races between DAX mmap fault and write path | linux-xfs

quoted

On Fri, Jul 29, 2016 at 05:53:07PM -0700, Dan Williams wrote:
On Fri, Jul 29, 2016 at 5:12 PM, Dave Chinner [off-list ref] wrote:
....
So what you are saying is that on and ADR machine, we have these
domains w.r.t. power fail:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------|-----persistent--------------|

because anything that gets to the IMC is guaranteed to be flushed to
stable media on power fail.

But on a posted-write-buffer system, we have this:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------------------------------|--persistent--|

IOWs, only things already posted to the media via REQ_FLUSH are
considered stable on persistent media.  What happens in this case
when power fails during a media update? Incomplete writes?
Yes, power failure during a media update will end up with incomplete
writes on an 8-byte boundary.
So we'd see that from the point of view of a torn single sector
write. Ok, so we better limit DAX to CRC enabled filesystems to
ensure these sorts of events are always caught by the filesystem.

Or have we somehow ended up with the fucked up situation where
dax_do_io() writes are (effectively) immediately persistent and
untracked by internal infrastructure, whilst mmap() writes
require internal dirty tracking and fsync() to flush caches via
writeback?
dax_do_io() writes are not immediately persistent.  They bypass the
cpu-cache and cpu-write-bufffer and are ready to be flushed to media
by REQ_FLUSH or power-fail on an ADR system.
IOWs, on an ADR system  write is /effectively/ immediately persistent
because if power fails ADR guarantees it will be flushed to stable
media, while on a posted write system it is volatile and will be
lost. Right?
Right.
Thanks for the clarification.

If we track the dirty blocks from write in the radix tree like we
for mmap, then we can just use a normal memcpy() in dax_do_io(),
getting rid of the slow cache bypass that is currently run. Radix
tree updates are much less expensive than a slow memcpy of large
amounts of data, ad fsync can then take care of persistence, just
like we do for mmap.
If we go this route to increase the amount of dirty-data tracking in
the radix it raises the priority of one of the items on the backlog;
namely, determine the crossover point where wbinvd of the entire cache
is faster than a clflush / clwb loop.
Actually, I'd look at it from the other persepctive - at what point
does fine-grained dirty tracking run faster than the brute force
flush? If the gains are only marginal, then we need to question
whether fine grained tracking is worth the complexity at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help