Re: Subtle races between DAX mmap fault and write path
From: Dan Williams <hidden>
Date: 2016-07-30 00:53:09
Also in:
linux-fsdevel, linux-xfs, nvdimm
On Fri, Jul 29, 2016 at 5:12 PM, Dave Chinner [off-list ref] wrote:
On Fri, Jul 29, 2016 at 07:44:25AM -0700, Dan Williams wrote:quoted
On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner [off-list ref] wrote:quoted
On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:quoted
On Thu 28-07-16 08:19:49, Dave Chinner wrote:[..]quoted
quoted
So DAX doesn't need flushing to maintain consistent view of the data but it does need flushing to make sure fsync(2) results in data written via mmap to reach persistent storage.I thought this all changed with the removal of the pcommit instruction and wmb_pmem() going away. Isn't it now a platform requirement now that dirty cache lines over persistent memory ranges are either guaranteed to be flushed to persistent storage on power fail or when required by REQ_FLUSH?No, nothing automates cache flushing. The path of a write is: cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media The ADR mechanism and the wpq-flush facility flush data thorough the imc (integrated memory controller) to media. dax_do_io() gets writes to the imc, but we still need a posted-write-buffer flush mechanism to guarantee data makes it out to media.So what you are saying is that on and ADR machine, we have these domains w.r.t. power fail: cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media |-------------volatile-------------------|-----persistent--------------| because anything that gets to the IMC is guaranteed to be flushed to stable media on power fail. But on a posted-write-buffer system, we have this: cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media |-------------volatile-------------------------------------------|--persistent--| IOWs, only things already posted to the media via REQ_FLUSH are considered stable on persistent media. What happens in this case when power fails during a media update? Incomplete writes?
Yes, power failure during a media update will end up with incomplete writes on an 8-byte boundary.
quoted
quoted
Or have we somehow ended up with the fucked up situation where dax_do_io() writes are (effectively) immediately persistent and untracked by internal infrastructure, whilst mmap() writes require internal dirty tracking and fsync() to flush caches via writeback?dax_do_io() writes are not immediately persistent. They bypass the cpu-cache and cpu-write-bufffer and are ready to be flushed to media by REQ_FLUSH or power-fail on an ADR system.IOWs, on an ADR system write is /effectively/ immediately persistent because if power fails ADR guarantees it will be flushed to stable media, while on a posted write system it is volatile and will be lost. Right?
Right.
If so, that's even worse than just having mmap/write behave differently - now writes will behave differently depending on the specific hardware installed. I think this makes it even more important for the DAX code to hide this behaviour from the fielsystems by treating everything as volatile.
The symmetry does sound appealing...
If we track the dirty blocks from write in the radix tree like we for mmap, then we can just use a normal memcpy() in dax_do_io(), getting rid of the slow cache bypass that is currently run. Radix tree updates are much less expensive than a slow memcpy of large amounts of data, ad fsync can then take care of persistence, just like we do for mmap.
If we go this route to increase the amount of dirty-data tracking in the radix it raises the priority of one of the items on the backlog; namely, determine the crossover point where wbinvd of the entire cache is faster than a clflush / clwb loop.
We should just make the design assumption that all persistent memory is volatile, track where we dirty it in all paths, and use the fastest volatile memcpy primitives available to us in the IO path. We'll end up with a faster fastpath that if we use CPU cache bypass copies, dax_do_io() and mmap will be coherent and synchronised, and fsync() will have the same requirements and overhead regardless of the way the application modifies the pmem or the hardware platform used to implement the pmem.
I like the direction, I'd still want to measure where/whether it's actually faster given the writes may have evicted hot data, and the amortized cost of the cache flushing loop.