Re: Subtle races between DAX mmap fault and write path | linux-ext4

quoted

On Tue, 2016-08-09 at 09:12 +1000, Dave Chinner wrote:
On Fri, Aug 05, 2016 at 07:58:33PM +0000, Boylston, Brian wrote:
Dave Chinner wrote on 2016-08-05:
[ cut to just the important points ]
On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
If I drop the fsync from the
buffered IO path, bandwidth remains the same but runtime
drops to 0.55-0.57s, so again the buffered IO write path is
faster than DAX while doing more work.
I do not think the test results are relevant on this point
because both buffered and dax write() paths use uncached copy
to avoid clflush.  The buffered path uses cached copy to the
page cache and then use uncached copy to PMEM via writeback.
 Therefore, the buffered IO path also benefits from using
uncached copy to avoid clflush.
Except that I tested without the writeback path for buffered IO,
so there was a direct comparison for single cached copy vs single
uncached copy.

The undenial fact is that a write() with a single cached copy
with all the overhead of dirty page tracking is /faster/ than a
much shorter, simpler IO path that uses an uncached copy. That's
what the numbers say....

Cached copy (req movq) is slightly faster than uncached copy,
Not according to Boaz - he claims that uncached is 20% faster
than cached. How about you two get together, do some benchmarking
and get your story straight, eh?

and should be used for writing to the page cache.  For writing
to PMEM, however, additional clflush can be expensive, and
allocating cachelines for PMEM leads to evict application's
cachelines.
I keep hearing people tell me why cached copies are slower, but
no-one is providing numbers to back up their statements. The only
numbers we have are the ones I've published showing cached copies
w/ full dirty tracking is faster than uncached copy w/o dirty
tracking.

Show me the numbers that back up your statements, then I'll
listen to you.
Here are some numbers for a particular scenario, and the code is
below.

Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM
buffer (1M total memcpy()s).  For the cached+clflush case, the
flushes are done every 4MiB (which seems slightly faster than
flushing every 16KiB):

                  NUMA local    NUMA remote
Cached+clflush      13.5           37.1
movnt                1.0            1.3 
So let's put that in memory bandwidth terms. You wrote 16GB to the
NVDIMM.  That means:

                  NUMA local    NUMA remote
Cached+clflush      1.2GB/s         0.43GB/s
movnt              16.0GB/s         12.3GB/s

That smells wrong.  The DAX code (using movnt) is not 1-2 orders of
magnitude faster than a page cache copy, so I don't believe your
benchmark reflects what I'm proposing.

What I think you're getting wrong is that we are not doing a clflush
after every 16k write when we use the page cache, nor will we do
that if we use cached copies, dirty tracking and clflush on fsync().
As I mentioned before, we do not use clflush on the write path.  So,
your tests did not issue clflush at all.

IOWs, the correct equivalent "cached + clflush" loop to a volatile
copy with dirty tracking + fsync would be:

	dstp = dst;
	while (--nloops) {
		memcpy(dstp, src, src_sz);	// pwrite();
		dstp += src_sz;
	}
        pmem_persist(dst, dstsz);	// fsync();

i.e. The cache flushes occur only at the user defined
synchronisation point not on every syscall.
Brian's test is (16 KiB pwrite + fsync) repeated 1M times.  It compared
two approaches in the case of 16 KiB persistent write.  I do not
cosider it wrong, but it indicated that cached copy + clflush will lead
much higher overhead when sync'd in a finer granularity.

I agree that it should have less overhead in total when clflush is done
at once since it only has to evict as much as the cache size.

Yes, if you want to make your copy slow and safe, use O_SYNC to
trigger clflush on every write() call - that's what we do for
existing storage and the mechanisms are already there; we just need
the dirty tracking to optimise it.
Perhaps, you are referring flushing on disk write cache?  I do not
think clflush as a x86 instruction is used for exisiting storage.

Put simple: we should only care about cache flush synchronisation at
user defined data integrity synchronisation points. That's the IO
model the kernel has always exposed to users, and pmem storage is no
different.
Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help