Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag

From: Dave Chinner <david@fromorbit.com>
Date: 2016-02-25 21:40:11

On Thu, Feb 25, 2016 at 01:08:28PM -0800, Phil Terry wrote:

On 02/25/2016 12:15 PM, Dave Chinner wrote:

quoted

On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:

quoted

Jeff Moyer [off-list ref] writes:

quoted

The big issue we have right now is that we haven't made the DAX/pmem
infrastructure work correctly and reliably for general use.  Hence
adding new APIs to workaround cases where we haven't yet provided
correct behaviour, let alone optimised for performance is, quite
frankly, a clear case premature optimisation.

Again, I see the two things as separate issues.  You need both.
Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
issue of making existing applications work safely.

I want to add one more thing to this discussion, just for the sake of
clarity.  When I talk about existing applications and pmem, I mean
applications that already know how to detect and recover from torn
sectors.  Any application that assumes hardware does not tear sectors
should be run on a file system layered on top of the btt.

Which turns off DAX, and hence makes this a moot discussion because
mmap is then buffered through the page cache and hence applications
*must use msync/fsync* to provide data integrity. Which also makes
them safe to use with DAX if we have a working fsync.

Keep in mind that existing storage technologies tear fileystem data
writes, too, because user data writes are filesystem block sized and
not atomic at the device level (i.e.  typical is 512 byte sector, 4k
filesystem block size, so there are 7 points in a single write where
a tear can occur on a crash).

Is that really true? Storage to date is on the PCIE/SATA etc IO
chain. The locks and application crash scenarios when traversing
down this chain are such that the device will not have its DMA
programmed until the whole 4K etc page is flushed to memory, pinned

Has nothing to do with DMA semantics. Storage devices we have to
deal with have volatile write caches, and we can't assume anything
about what they write when power fails except that single sector
writes are atomic.

In both cases, btt is not indirecting the buffer (as for a DMA
master IO type device) but is simply using the same pmem api
primitives to manage its own meta data about the filesystem writes
to detect and recover from tears after the event. In what sense is
DAX disabled for this?

BTT is, IIRC, using writeahead logging to stage every IO into pmem
so that after a crash the entire write can be recovered and replayed
to overwrite any torn sectors. This requires buffering at page cache
level, as direct writes to the pmem will not get logged. Hence DAX
cannot be used on BTT devices. Indeed:

static const struct block_device_operations btt_fops = {
        .owner =                THIS_MODULE,
        .rw_page =              btt_rw_page,
        .getgeo =               btt_getgeo,
        .revalidate_disk =      nvdimm_revalidate_disk,
};

There's no .direct_access method implemented for btt devices, so
it's clear that filesystems on BTT devices cannot enable DAX.

So I think (please correct me if I'm wrong) but actually the
hardware/firmware guys have been fixing the torn sector problem for

I was not talking about torn /sectors/. I was talking about a user
data write being made up of *multiple sectors*, and so there is no
atomicity guarantee for a user data write on existing storage when
the filesystem block size (user data IO size) is larger than the
device sector size. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help