Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
From: Dan Williams <hidden>
Date: 2016-02-11 20:58:38
Also in:
linux-ext4, linux-fsdevel, linux-xfs, lkml, nvdimm
On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner [off-list ref] wrote: [..]
quoted
It seems to me we need to modify the metadata i/o paths to bypass the page cache,XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX....
Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage.
quoted
or teach the fsync code how to flush populated data pages out of the radix.That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right?
I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>