Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling

[PATCH v2 00/11] DAX fsynx/msync support · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 01/11] pmem: add wb_cache_pmem() to the PMEM API · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 02/11] mm: add pmd_mkclean() · Ross Zwisler <hidden> · 2015-11-14
Re: [PATCH v2 02/11] mm: add pmd_mkclean() · Dave Hansen <hidden> · 2015-11-14
Re: [PATCH v2 02/11] mm: add pmd_mkclean() · Ross Zwisler <hidden> · 2015-11-17
[PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-14
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dan Williams <hidden> · 2015-11-14
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Andreas Dilger <hidden> · 2015-11-14
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dan Williams <hidden> · 2015-11-14
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Jan Kara <jack@suse.cz> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Jan Kara <jack@suse.cz> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dan Williams <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dan Williams <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dave Chinner <david@fromorbit.com> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Dave Chinner <david@fromorbit.com> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-16
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Jan Kara <jack@suse.cz> · 2015-11-18
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling · Ross Zwisler <hidden> · 2015-11-18
[PATCH v2 04/11] dax: support dirty DAX entries in radix tree · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 05/11] mm: add follow_pte_pmd() · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 06/11] mm: add pgoff_mkclean() · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 07/11] mm: add find_get_entries_tag() · Ross Zwisler <hidden> · 2015-11-14
Re: [PATCH v2 07/11] mm: add find_get_entries_tag() · Dave Chinner <david@fromorbit.com> · 2015-11-16
Re: [PATCH v2 07/11] mm: add find_get_entries_tag() · Ross Zwisler <hidden> · 2015-11-17
[PATCH v2 08/11] dax: add support for fsync/sync · Ross Zwisler <hidden> · 2015-11-14
Re: [PATCH v2 08/11] dax: add support for fsync/sync · Dave Chinner <david@fromorbit.com> · 2015-11-16
Re: [PATCH v2 08/11] dax: add support for fsync/sync · Ross Zwisler <hidden> · 2015-11-17
[PATCH v2 09/11] ext2: add support for DAX fsync/msync · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 10/11] ext4: add support for DAX fsync/msync · Ross Zwisler <hidden> · 2015-11-14
[PATCH v2 11/11] xfs: add support for DAX fsync/msync · Ross Zwisler <hidden> · 2015-11-14
Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync · Dave Chinner <david@fromorbit.com> · 2015-11-16
Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync · Ross Zwisler <hidden> · 2015-11-17
Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync · Dave Chinner <david@fromorbit.com> · 2015-11-20
Re: [PATCH v2 00/11] DAX fsynx/msync support · Jan Kara <jack@suse.cz> · 2015-11-16
Re: [PATCH v2 00/11] DAX fsynx/msync support · Dan Williams <hidden> · 2015-11-16
Re: [PATCH v2 00/11] DAX fsynx/msync support · Ross Zwisler <hidden> · 2015-11-16

From: Ross Zwisler <hidden>
Date: 2015-11-16 19:48:46
Also in: linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:

On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara [off-list ref] wrote:

quoted

On Mon 16-11-15 14:37:14, Jan Kara wrote:

[..]

quoted

But a question: Won't it be better to do sfence + pcommit only in response
to REQ_FLUSH request and don't do it after each write? I'm not sure how
expensive these instructions are but in theory it could be a performance
win, couldn't it? For filesystems this is enough wrt persistency
guarantees...

We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help