Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
From: Ross Zwisler <hidden>
Date: 2015-11-16 19:48:46
Also in:
linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm
On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara [off-list ref] wrote:quoted
On Mon 16-11-15 14:37:14, Jan Kara wrote:[..]quoted
But a question: Won't it be better to do sfence + pcommit only in response to REQ_FLUSH request and don't do it after each write? I'm not sure how expensive these instructions are but in theory it could be a performance win, couldn't it? For filesystems this is enough wrt persistency guarantees...We would need to gather the performance data... The expectation is that the cache flushing is more expensive than the sfence + pcommit.
I think we should revisit the idea of removing wmb_pmem() from the I/O path in both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH path to do wmb_pmem() for all cases. This was brought up in the thread dealing with the "big hammer" fsync/msync patches as well. https://lkml.org/lkml/2015/11/3/730 I think we can all agree from the start that wmb_pmem() will have a nonzero cost, both because of the PCOMMIT and because of the ordering caused by the sfence. If it's possible to avoid doing it on each I/O, I think that would be a win. So, here would be our new flows: PMEM I/O: write I/O(s) to the driver PMEM I/O writes the data using non-temporal stores REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX I/O: write I/O(s) to the DAX layer write the data using regular stores (eventually to be replaced with non-temporal stores) flush the data with wb_cache_pmem() (removed when we use non-temporal stores) REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX msync/fsync: writes happen to DAX mmaps from userspace DAX fsync/msync all dirty pages are written back using wb_cache_pmem() REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772): PMEM driver receives zeroing request writes a bunch of zeroes using non-temporal stores REQ_FUA/REQ_FLUSH to the PMEM driver wmb_pmem() to order all previous writes and flushes, and to PCOMMIT the dirty data durably to the DIMMs Having all these flows wait to do wmb_pmem() in the PMEM driver in response to REQ_FUA/REQ_FLUSH has several advantages: 1) The work done and guarantees provided after each step closely match the normal block I/O to disk case. This means that the existing algorithms used by filesystems to make sure that their metadata is ordered properly and synced at a known time should all work the same. 2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do many I/Os at different levels, and order them all with a single wmb_pmem(). This should result in a performance win. Is there any reason why this wouldn't work or wouldn't be a good idea? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>