Re: Subtle races between DAX mmap fault and write path

From: Dan Williams <hidden>
Date: 2016-08-01 04:39:47
Also in: linux-fsdevel, linux-xfs, nvdimm

On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner [off-list ref] wrote:

On Sun, Jul 31, 2016 at 08:13:23PM -0700, Keith Packard wrote:

quoted

Dave Chinner [off-list ref] writes:

quoted

So we'd see that from the point of view of a torn single sector
write. Ok, so we better limit DAX to CRC enabled filesystems to
ensure these sorts of events are always caught by the filesystem.

Which is the same lack of guarantee that we already get on rotating
media. Flash media seems to work harder to provide sector atomicity; I
guess that's a feature?

No, that's not the case. Existing sector based storage guarantees
atomic sector writes - rotating or solid state. I haven't seen one a
corruption caused by a torn sector write in 15 years.  BTT layer was
written for pmem to provide this guarantee, but you can't use DAX
through that layer.

In XFS, the only place we really care about torn sector writes in
the journal - we have CRCs there to detect such problems and CRC
enabled filesystems will prevent recovery of checkpoints containing
torn writes. Non-CRC enable filesystems just warn and continue on
their merry way (compatibility reasons - older kernels don't emit
CRCs in log writes), so we really do need to restrict DAX to CRC
enabled filesystems.

quoted

The alternative is to hide metadata behind a translation layer, and
while that can be done for lame file systems,

No. These "lame" filesystems just need to use their existing
metadata buffering layer to hide this and the journalling subsystem
should protects against torn metadata writes.

quoted

I'd like to see the raw
hardware capabilities exposed and then make free software that
constructs a reliable system on top of that.

Not as easy as it sounds.  A couple of weeks ago I tried converting
the XFS journal to use DAX to avoid the IO layer and resultant
memcpy(), and I found out that there's a major problem with using
DAX for metadata access on existing filesystems: pmem is not
physically contiguous. I found that out the hard way - the ramdisk
is DAX capable, but each 4k page ithat is allocated to it has a
different memory address.

Filesystems are designed around the fact that the block address
space they are working with is contiguous - that sector N and sector
N+1 can be accessed in the same IO. This is not true for
direct access of pmem - while the sectors might be logically
contiguous, the physical memory that is directly accessed is not.
i.e. when we cross a page boundary, the next page could be on a
different node, on a different device (e.g.  RAID0), etc.
Traditional remapping devices (i.e. DM, md, etc) hide the physical
discontiguities from the filesystem - the present a contiguous LBA
and remap/split/combine under the covers where the filesystem is not
aware of it at all.

The reason this is important is that if the filesystem has metadata
constructs larger than a single page it can't use DAX to access them
as a single object because they may lay across a physical
discontiguity in the memory map.  Filesystems aren't usually exposed
to this - sectors of a block device are indexed by "Logical Block
Address" for good reason - the LBA address space is supposed to hide
the physical layout of the storage from the layer above the block
device.

OTOH, DAX directly exposes the physical layout to the filesytem.
And because it's DAX-based pmem and not cached struct pages, we
can't run vm_map_ram() to virtually map the range we need to see as
a contiguous range, as we do in XFS for large objects such as directory
blocks and log buffers. For other large objects such as inode
clusters, we can directly map each page as the objects within the
clusters are page aligned and never overlap page boundaries, but
that only works for inode and dquot buffers. Hence DAX as it stands
makes it extremely difficult to "retrofit" DAX into all aspects of
existing fileystems because exposing physical discontiguities breaks
code that assumes they don't exist.

On this specific point about page remapping, the administrator can
configure struct pages for pmem and you can detect whether they are
present in the filesystem with pfn_t_has_page().  I.e. you could
require pages be present for XFS, if that helps...

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help