Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
From: Jan Kara <jack@suse.cz>
Date: 2017-08-14 12:48:03
Also in:
linux-api, linux-fsdevel, lkml, nvdimm
On Sun 13-08-17 13:31:45, Dan Williams wrote:
On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig [off-list ref] wrote:quoted
Thay being said I think we absolutely should support RDMA memory registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure all the blocks are polulated and all ptes are set up. Second we need to make sure get_user_page works, which for now means we'll need a struct page mapping for the region (which will be really annoying for PCIe mappings, like the upcoming NVMe persistent memory region), and we need to gurantee that the extent mapping won't change while the get_user_pages holds the pages inside it. I think that is true due to side effects even with the current DAX code, but we'll need to make it explicit. And maybe that's where we need to converge - "sealing" the extent map makes sense as such a temporary measure that is not persisted on disk, which automatically gets released when the holding process exits, because we sort of already do this implicitly. It might also make sense to have explicitl breakable seals similar to what I do for the pNFS blocks kernel server, as any userspace RDMA file server would also need those semantics.Ok, how about a MAP_DIRECT flag that arranges for faults to that range to: 1/ only succeed if the fault can be satisfied without page cache 2/ only install a pte for the fault if it can do so without triggering block map updates So, I think it would still end up setting an inode flag to make xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping active. However, it would not record that state in the on-disk metadata and it would automatically clear at munmap time. That should be enough to support the host-persistent-memory, and NVMe-persistent-memory use cases (provided we have struct page for NVMe). Although, we need more safety infrastructure in the NVMe case where we would need to software manage I/O coherence.
Hum, this proposal (and the problems you are trying to deal with) seem very similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to the DAX area (and so additionally complicated by the fact that filesystems now have to care). The patch set was not merged due to lack of interest I think but it looked sensible and the proposed API would make sense for more stuff than just DAX so maybe it would be better than MAP_DIRECT flag? [1] https://lwn.net/Articles/600502/ Honza -- Jan Kara [off-list ref] SUSE Labs, CR