RE: Subtle races between DAX mmap fault and write path

From: Boylston, Brian <hidden>
Date: 2016-08-08 13:02:31
Also in: linux-fsdevel, linux-xfs, nvdimm

Jan Kara wrote on 2016-08-08:

On Fri 05-08-16 19:58:33, Boylston, Brian wrote:

quoted

Dave Chinner wrote on 2016-08-05:

quoted

[ cut to just the important points ]
On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:

quoted

On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:

quoted

If I drop the fsync from the
buffered IO path, bandwidth remains the same but runtime drops to
0.55-0.57s, so again the buffered IO write path is faster than DAX
while doing more work.

I do not think the test results are relevant on this point because both
buffered and dax write() paths use uncached copy to avoid clflush.  The
buffered path uses cached copy to the page cache and then use uncached copy to
PMEM via writeback.  Therefore, the buffered IO path also benefits from using
uncached copy to avoid clflush.

Except that I tested without the writeback path for buffered IO, so
there was a direct comparison for single cached copy vs single
uncached copy.

The undenial fact is that a write() with a single cached copy with
all the overhead of dirty page tracking is /faster/ than a much
shorter, simpler IO path that uses an uncached copy. That's what the
numbers say....

quoted

Cached copy (req movq) is slightly faster than uncached copy,

Not according to Boaz - he claims that uncached is 20% faster than
cached. How about you two get together, do some benchmarking and get
your story straight, eh?

quoted

and should be
used for writing to the page cache.  For writing to PMEM, however, additional
clflush can be expensive, and allocating cachelines for PMEM leads to evict
application's cachelines.

I keep hearing people tell me why cached copies are slower, but
no-one is providing numbers to back up their statements. The only
numbers we have are the ones I've published showing cached copies w/
full dirty tracking is faster than uncached copy w/o dirty tracking.

Show me the numbers that back up your statements, then I'll listen
to you.

Here are some numbers for a particular scenario, and the code is below.

Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
(1M total memcpy()s).  For the cached+clflush case, the flushes are done
every 4MiB (which seems slightly faster than flushing every 16KiB):

                  NUMA local    NUMA remote
Cached+clflush      13.5           37.1
movnt                1.0            1.3

Thanks for the test Brian. But looking at the current source of libpmem
this seems to be comparing apples to oranges. Let me explain the details
below:

quoted

In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:

Yes. libpmem does what you describe above and the name
pmem_memcpy_persist() is thus currently misleading because it is not
guaranteed to be persistent with the current implementation of DAX in
the kernel.

It is important to know which kernel version and what filesystem have you
used for the test to be able judge the details but generally pmem_persist()
does properly tell the filesystem to flush all metadata associated with the
file, commit open transactions etc. That's the full cost of persistence.

I used NVML 1.1 for the measurements.  In this version and with the hardware
that I used, the pmem_persist() flow is:

  pmem_persist()
    pmem_flush()
      Func_flush() == flush_clflush
        CLFLUSH
    pmem_drain()
      Func_predrain_fence() == predrain_fence_empty
        no-op

So, I don't think that pmem_persist() does anything to cause the filesystem
to flush metadata as it doesn't make any system calls?

pmem_memcpy_persist() makes sure the data writes have reached persistent
storage but nothing guarantees associated metadata changes have reached
persistent storage as well.

While metadata is certainly important, my goal with this specific test was
to measure the "raw" performance of cached+flush vs uncached, without
anything else in the way.

To assure that, fsync() (or pmem_persist()
if you wish) is currently the only way from userspace.

Perhaps you mean pmem_msync() here?  pmem_msync() calls msync(), but
pmem_persist() does not.

At which point
you've lost most of the advantages using movnt. Ross researches into
possibilities of allowing more efficient userspace implementation but
currently there are none.

Apart from the current performance discussion, if the metadata for a file
is already established (file created, space allocated by explicit writes(),
and everything synced), then if I map it and do pmem_memcpy_persist(),
are there any "ongoing" metadata updates that would need to be flushed
(besides timestamps)?


Brian

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help