RE: Subtle races between DAX mmap fault and write path
From: Boylston, Brian <hidden>
Date: 2016-08-06 21:33:33
Also in:
linux-fsdevel, linux-xfs, nvdimm
Dave Chinner wrote on 2016-08-05:
[ cut to just the important points ] On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:quoted
On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:quoted
If I drop the fsync from the buffered IO path, bandwidth remains the same but runtime drops to 0.55-0.57s, so again the buffered IO write path is faster than DAX while doing more work.I do not think the test results are relevant on this point because both buffered and dax write() paths use uncached copy to avoid clflush. The buffered path uses cached copy to the page cache and then use uncached copy to PMEM via writeback. Therefore, the buffered IO path also benefits from using uncached copy to avoid clflush.Except that I tested without the writeback path for buffered IO, so there was a direct comparison for single cached copy vs single uncached copy. The undenial fact is that a write() with a single cached copy with all the overhead of dirty page tracking is /faster/ than a much shorter, simpler IO path that uses an uncached copy. That's what the numbers say....quoted
Cached copy (req movq) is slightly faster than uncached copy,Not according to Boaz - he claims that uncached is 20% faster than cached. How about you two get together, do some benchmarking and get your story straight, eh?quoted
and should be used for writing to the page cache. For writing to PMEM, however, additional clflush can be expensive, and allocating cachelines for PMEM leads to evict application's cachelines.I keep hearing people tell me why cached copies are slower, but no-one is providing numbers to back up their statements. The only numbers we have are the ones I've published showing cached copies w/ full dirty tracking is faster than uncached copy w/o dirty tracking. Show me the numbers that back up your statements, then I'll listen to you.
Here are some numbers for a particular scenario, and the code is below.
Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
(1M total memcpy()s). For the cached+clflush case, the flushes are done
every 4MiB (which seems slightly faster than flushing every 16KiB):
NUMA local NUMA remote
Cached+clflush 13.5 37.1
movnt 1.0 1.3
In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <libpmem.h>
/*
* gcc -Wall -O2 -m64 -mcx16 -o memcpyperf memcpyperf.c -lpmem
*
* Not sure if -mcx16 allows gcc to use faster memcpy bits?
*/
/*
* our source buffer. we'll copy this much at a time.
* align it so that memcpy() doesn't have to do anything funny.
*/
char __attribute__((aligned(0x100))) src[4 * 4096];
int
main(
int argc,
char* argv[]
)
{
char* path;
char mode;
int nloops;
char* dstbase;
size_t dstsz;
int ispmem;
int cpysz;
char* dst;
if (argc != 4) {
fprintf(stderr, "ERROR: usage: "
"memcpyperf [cached | nt] PATH NLOOPS\n");
exit(1);
}
mode = argv[1][0];
path = argv[2];
nloops = atoi(argv[3]);
dstbase = pmem_map_file(path, 0, 0, 0, &dstsz, &ispmem);
if (NULL == dstbase) {
perror(path);
exit(1);
}
if (!ispmem)
fprintf(stderr, "WARNING: %s is not pmem\n", path);
if (dstsz < sizeof(src))
cpysz = dstsz;
else
cpysz = sizeof(src);
fprintf(stderr, "INFO: dst %p src %p dstsz %ld cpysz %d\n",
dstbase, src, dstsz, cpysz);
dst = dstbase;
while (nloops--) {
if (mode == 'c') {
memcpy(dst, src, cpysz);
/*
* we could do the clflush here on the 16KiB we just
* wrote, but with a 4MiB file (dst buffer) and 16KiB
* src buffer, it seems slightly faster to flush the
* entire 4MiB below
*/
//pmem_persist(dst, cpysz);
}
else {
pmem_memcpy_persist(dst, src, cpysz);
}
dst += cpysz;
if ((dst + cpysz) - dstbase > dstsz) {
dst = dstbase;
/* see note above */
if (mode == 'c')
pmem_persist(dst, dstsz);
}
}
exit(0);
} /* main() */
EOF
Sample runs:
$ numactl -N0 time -p ./memcpyperf c /pmem0/brian/cpt 1000000
INFO: dst 0x7f3f1a000000 src 0x601200 dstsz 4194304 cpysz 16384
real 13.53
user 13.53
sys 0.00
$ numactl -N0 time -p ./memcpyperf n /pmem0/brian/cpt 1000000
INFO: dst 0x7f2b54600000 src 0x601200 dstsz 4194304 cpysz 16384
real 1.04
user 1.04
sys 0.00
$ numactl -N1 time -p ./memcpyperf c /pmem0/brian/cpt 1000000
INFO: dst 0x7f8f8c200000 src 0x601200 dstsz 4194304 cpysz 16384
real 37.13
user 37.15
sys 0.00
$ numactl -N1 time -p ./memcpyperf n /pmem0/brian/cpt 1000000
INFO: dst 0x7f77f7400000 src 0x601200 dstsz 4194304 cpysz 16384
real 1.24
user 1.24
sys 0.00
Brian