Thread (31 messages) 31 messages, 4 authors, 2021-12-15

Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

From: Vivek Goyal <vgoyal@redhat.com>
Date: 2021-12-15 16:01:19
Also in: dm-devel, linux-fsdevel, linux-s390, nvdimm

On Wed, Dec 15, 2021 at 10:30:50AM +0000, Stefan Hajnoczi wrote:
On Tue, Dec 14, 2021 at 03:32:43PM -0500, Vivek Goyal wrote:
quoted
On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote:
quoted
On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal [off-list ref] wrote:
quoted
On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
quoted
On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
quoted
On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal [off-list ref] wrote:
quoted
Going forward, I am wondering should virtiofs use flushcache version as
well. What if host filesystem is using DAX and mapping persistent memory
pfn directly into qemu address space. I have never tested that.

Right now we are relying on applications to do fsync/msync on virtiofs
for data persistence.
This sounds like it would need coordination with a paravirtualized
driver that can indicate whether the host side is pmem or not, like
the virtio_pmem driver. However, if the guest sends any fsync/msync
you would still need to go explicitly cache flush any dirty page
because you can't necessarily trust that the guest did that already.
Do we?  The application can't really know what backend it is on, so
it sounds like the current virtiofs implementation doesn't really, does it?
Agreed that application does not know what backend it is on. So virtiofs
just offers regular posix API where applications have to do fsync/msync
for data persistence. No support for mmap(MAP_SYNC). We don't offer persistent
memory programming model on virtiofs. That's not the expectation. DAX
is used only to bypass guest page cache.

With this assumption, I think we might not have to use flushcache version
at all even if shared filesystem is on persistent memory on host.

- We mmap() host files into qemu address space. So any dax store in virtiofs
  should make corresponding pages dirty in page cache on host and when
  and fsync()/msync() comes later, it should flush all the data to PMEM.

- In case of file extending writes, virtiofs falls back to regular
  FUSE_WRITE path (and not use DAX), and in that case host pmem driver
  should make sure writes are flushed to pmem immediately.

Are there any other path I am missing. If not, looks like we might not
have to use flushcache version in virtiofs at all as long as we are not
offering guest applications user space flushes and MAP_SYNC support.

We still might have to use machine check safe variant though as loads
might generate synchronous machine check. What's not clear to me is
that if this MC safe variant should be used only in case of PMEM or
should it be used in case of non-PMEM as well.
It should be used on any memory address that can throw exception on
load, which is any physical address, in paths that can tolerate
memcpy() returning an error code, most I/O paths, and can tolerate
slower copy performance on older platforms that do not support MC
recovery with fast string operations, to date that's only PMEM users.
Ok, So basically latest cpus can do fast string operations with MC
recovery so that using MC safe variant is not a problem.

Then there is range of cpus which can do MC recovery but do slower
versions of memcpy and that's where the issue is.

So if we knew that virtiofs dax window is backed by a pmem device
then we should always use MC safe variant. Even if it means paying
the price of slow version for the sake of correctness. 

But if we are not using pmem on host, then there is no point in
using MC safe variant.

IOW.

	if (virtiofs_backed_by_pmem) {
		use_mc_safe_version
	else
		use_non_mc_safe_version
	}

Now question is, how do we know if virtiofs dax window is backed by
a pmem or not. I checked virtio_pmem driver and that does not seem
to communicate anything like that. It just communicates start of the
range and size of range, nothing else.

I don't have full handle on stack of modules of virtio_pmem, but my guess
is it probably is using MC safe version always (because it does not
know anthing about the backing storage).

/me will definitely like to pay penalty of slower memcpy if virtiofs
device is not backed by a pmem.
Reads from the page cache handle machine checks (filemap_read() ->
raw_copy_to_user()). I think virtiofs should therefore always handle
machine checks when reading from the DAX Window.
IIUC, raw_copy_to_user() does not handle recovery from machine check. For
example, it can call copy_user_enhanced_fast_string() if cpu supports
X86_FEATURE_ERMS. But equivalent machine check safe version is
copy_mc_enhanced_fast_string() instead.

Hence, I don't think reading from page cache is using machine check safe
variants by default. This copy_mc_to_user() path has to be taken
explicitly for machine check safe variants. And currently only pmem driver
seems to take it by calling _copy_mc_to_iter().

Thanks
Vivek

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help