Thread (39 messages) 39 messages, 6 authors, 2015-11-20

Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling

From: Ross Zwisler <hidden>
Date: 2015-11-16 19:48:46
Also in: linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara [off-list ref] wrote:
quoted
On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
quoted
But a question: Won't it be better to do sfence + pcommit only in response
to REQ_FLUSH request and don't do it after each write? I'm not sure how
expensive these instructions are but in theory it could be a performance
win, couldn't it? For filesystems this is enough wrt persistency
guarantees...
We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.
I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help