Thread (10 messages) 10 messages, 5 authors, 2018-11-02

Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

From: Jan Kara <hidden>
Date: 2018-10-18 14:55:55
Also in: linux-ext4, linux-fsdevel, linux-mm, linux-xfs, nvdimm

On Thu 18-10-18 11:25:10, Dave Chinner wrote:
On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
quoted
MAP_SYNC
- file system guarantees that metadata required to reach faulted-in file
  data is consistent on media before a write fault is completed.  A
  side-effect is that the page cache will not be used for
  writably-mapped pages.
I think you are conflating current implementation with API
requirements - MAP_SYNC doesn't guarantee anything about page cache
use. The man page definition simply says "supported only for files
supporting DAX" and that it provides certain data integrity
guarantees. It does not define the implementation.

We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
because that's the only way we can currently provide the required
behaviour to userspace. However, if a filesystem can use the page
cache to provide the required functionality, then it's free to do
so.

i.e. if someone implements a pmem-based page cache, MAP_SYNC data
integrity could be provided /without DAX/ by any filesystem using
that persistent page cache. i.e. MAP_SYNC really only requires
mmap() of CPU addressable persistent memory - it does not require
DAX. Right now, however, the only way to get this functionality is
through a DAX capable filesystem on dax capable storage.

And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
COWs new pages in pmem and attaches them a special per-inode cache
on clean->dirty transition. Then on data sync, background writeback
or crash recovery, it migrates them from the cache into the file map
proper via atomic metadata pointer swaps.

IOWs, NOVA provides the correct MAP_SYNC semantics by using a
separate persistent per-inode write cache to provide the correct
crash recovery semantics for MAP_SYNC.
Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
effectively it will be also able to provide MAP_DIRECT semantics, right?
Because there won't be DRAM between app and persistent storage and I don't
think COW tricks or other data integrity methods are that interesting for
the application. Most users of O_DIRECT are concerned about getting close
to media speed performance and low DRAM usage...
quoted
and what I think Dan had proposed:

mmap flag, MAP_DIRECT
- file system guarantees that page cache will not be used to front storage.
  storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
  The subtle difference is that a write fault /may/ not result in metadata
  being written back to media.
SIimilar to O_DIRECT, these semantics do not allow userspace apps to
replace msync/fsync with CPU cache flush operations. So any
application that uses this mode still needs to use either MAP_SYNC
or issue msync/fsync for data integrity.

If the app is using MAP_DIRECT, the what do we do if the filesystem
can't provide the required semantics for that specific operation? In
the case of O_DIRECT, we fall back to buffered IO because it has the
same data integrity semantics as O_DIRECT and will always work. It's
just slower and consumes more memory, but the app continues to work
just fine.

Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
without using the pagecache seems extremely problematic to me.  e.g.
an app already has an open MAP_DIRECT file, and a third party
reflinks it or dedupes it and the fs has to fall back to buffered IO
to do COW operations. This isn't the app's fault - the kernel should
just fall back transparently to using the page cache for the
MAP_DIRECT app and just keep working, just like it would if it was
using O_DIRECT read/write.
There's another option of failing reflink / dedupe with EBUSY if the file
is mapped with MAP_DIRECT and the filesystem cannot support relink &
MAP_DIRECT together. But there are downsides to that as well.
The point I'm trying to make here is that O_DIRECT is a /hint/, not
a guarantee, and it's done that way to prevent applications from
being presented with transient, potentially fatal error cases
because a filesystem implementation can't do a specific operation
through the direct IO path.

IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
guarantee. Over time we'll end up with filesystems that can
guarantee that MAP_DIRECT is always going to use DAX, in the same
way we have filesystems that guarantee O_DIRECT will always be
O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
no page cache will ever be used, then we are basically saying
"filesystems won't provide MAP_DIRECT even in common, useful cases
because they can't provide MAP_DIRECT in all cases." And that
doesn't seem like a very good solution to me.
These are good points. I'm just somewhat wary of the situation where users
will map files with MAP_DIRECT and then the machine starts thrashing
because the file got reflinked and thus pagecache gets used suddently.
With O_DIRECT the fallback to buffered IO is quite rare (at least for major
filesystems) so usually people just won't notice. If fallback for
MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help