Thread (38 messages) 38 messages, 9 authors, 2016-12-21

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

From: Nicholas Piggin <npiggin@gmail.com>
Date: 2016-09-13 01:53:11
Also in: kvm, linux-fsdevel, lkml, nvdimm

On Tue, 13 Sep 2016 07:34:36 +1000
Dave Chinner [off-list ref] wrote:
On Mon, Sep 12, 2016 at 06:05:07PM +1000, Nicholas Piggin wrote:
quoted
On Mon, 12 Sep 2016 00:51:28 -0700
Christoph Hellwig [off-list ref] wrote:
  
quoted
On Mon, Sep 12, 2016 at 05:25:15PM +1000, Oliver O'Halloran wrote:  
quoted
What are the problems here? Is this a matter of existing filesystems
being unable/unwilling to support this or is it just fundamentally
broken?    
It's a fundamentally broken model.  See Dave's post that actually was
sent slightly earlier then mine for the list of required items, which
is fairly unrealistic.  You could probably try to architect a file
system for it, but I doubt it would gain much traction.  
It's not fundamentally broken, it just doesn't fit well existing
filesystems.

Dave's post of requirements is also wrong. A filesystem does not have
to guarantee all that, it only has to guarantee that is the case for
a given block after it has a mapping and page fault returns, other
operations can be supported by invalidating mappings, etc.  
Sure, but filesystems are completely unaware of what is mapped at
any given time, or what constraints that mapping might have. Trying
to make filesystems aware of per-page mapping constraints seems like
I'm not sure what you mean. The filesystem can hand out mappings
and fault them in itself. It can invalidate them.

a fairly significant layering violation based on a flawed
assumption. i.e. that operations on other parts of the file do not
affect the block that requires immutable metadata.

e.g an extent operation in some other area of the file can cause a
tip-to-root extent tree split or merge, and that moves the metadata
that points to the mapped block that we've told userspace "doesn't
need fsync".  We now need an fsync to ensure that the metadata is
consistent on disk again, even though that block has not physically
been moved.
You don't, because the filesystem can invalidate existing mappings
and do the right thing when they are faulted in again. That's the
big^Wmedium hammer approach that can cope with most problems.

But let me understand your example in the absence of that.

- Application mmaps a file, faults in block 0
- FS allocates block, creates mappings, syncs metadata, sets "no fsync"
  flag for that block, and completes the fault.
- Application writes some data to block 0, completes userspace flushes

* At this point, a crash must return with above data (or newer).

- Application starts writing more stuff into block 0
- Concurrently, fault in block 1
- FS starts to allocate, splits trees including mappings to block 0

* Crash

Is that right? How does your filesystem lose data before the sync
point?
IOWs, the immutable data block updates are now not
ordered correctly w.r.t. other updates done to the file, especially
when we consider crash recovery....

All this will expose is an unfixable problem with ordering of stable
data + metadata operations and their synchronisation. As such, it
seems like nothing but a major cluster-fuck to try to do mapping
specific, per-block immutable metadata - it adds major complexity
and even more untractable problems.

Yes, we /could/ try to solve this but, quite frankly, it's far
easier to change the broken PMEM programming model assumptions than
it is to implement what you are suggesting. Or to do what Christoph
suggested and just use a wrapper around something like device
mapper to hand out chunks of unchanging, static pmem to
applications...
If there is any huge complexity or unsolved problem, it is in XFS.
Conceptual problem is simple.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help