Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

From: Jason Gunthorpe <jgg@ziepe.ca>
Date: 2018-11-19 19:27:07
Also in: linux-crypto, linux-doc, linux-rdma, lkml

On Mon, Nov 19, 2018 at 02:17:21PM -0500, Jerome Glisse wrote:

On Mon, Nov 19, 2018 at 11:53:33AM -0700, Jason Gunthorpe wrote:

quoted

On Mon, Nov 19, 2018 at 01:42:16PM -0500, Jerome Glisse wrote:

quoted

On Mon, Nov 19, 2018 at 11:27:52AM -0700, Jason Gunthorpe wrote:

quoted

On Mon, Nov 19, 2018 at 11:48:54AM -0500, Jerome Glisse wrote:

quoted

Just to comment on this, any infiniband driver which use umem and do
not have ODP (here ODP for me means listening to mmu notifier so all
infiniband driver except mlx5) will be affected by same issue AFAICT.

AFAICT there is no special thing happening after fork() inside any of
those driver. So if parent create a umem mr before fork() and program
hardware with it then after fork() the parent might start using new
page for the umem range while the old memory is use by the child. The
reverse is also true (parent using old memory and child new memory)
bottom line you can not predict which memory the child or the parent
will use for the range after fork().

So no matter what you consider the child or the parent, what the hw
will use for the mr is unlikely to match what the CPU use for the
same virtual address. In other word:

Before fork:
    CPU parent: virtual addr ptr1 -> physical address = 0xCAFE
    HARDWARE:   virtual addr ptr1 -> physical address = 0xCAFE

Case 1:
    CPU parent: virtual addr ptr1 -> physical address = 0xCAFE
    CPU child:  virtual addr ptr1 -> physical address = 0xDEAD
    HARDWARE:   virtual addr ptr1 -> physical address = 0xCAFE

Case 2:
    CPU parent: virtual addr ptr1 -> physical address = 0xBEEF
    CPU child:  virtual addr ptr1 -> physical address = 0xCAFE
    HARDWARE:   virtual addr ptr1 -> physical address = 0xCAFE

IIRC this is solved in IB by automatically calling
madvise(MADV_DONTFORK) before creating the MR.

MADV_DONTFORK
  .. This is useful to prevent copy-on-write semantics from changing the
  physical location of a page if the parent writes to it after a
  fork(2) ..

This would work around the issue but this is not transparent ie
range marked with DONTFORK no longer behave as expected from the
application point of view.

Do you know what the difference is? The man page really gives no
hint..

Does it sometimes unmap the pages during fork?

It is handled in kernel/fork.c look for DONTCOPY, basicaly it just
leave empty page table in the child process so child will have to
fault in new page. This also means that child will get 0 as initial
value for all memory address under DONTCOPY/DONTFORK which breaks
application expectation of what fork() do.

Hum, I wonder why this API was selected then..

quoted

I actually wonder if the kernel is a bit broken here, we have the same
problem with O_DIRECT and other stuff, right?

No it is not, O_DIRECT is fine. The only corner case i can think
of with O_DIRECT is one thread launching an O_DIRECT that write
to private anonymous memory (other O_DIRECT case do not matter)
while another thread call fork() then what the child get can be
undefined ie either it get the data before the O_DIRECT finish
or it gets the result of the O_DIRECT. But this is realy what
you should expect when doing such thing without synchronization.

So O_DIRECT is fine.

?? How can O_DIRECT be fine but RDMA not? They use exactly the same
get_user_pages flow, right? Can we do what O_DIRECT does in RDMA and
be fine too?

AFAIK the only difference is the length of the race window. You'd have
to fork and fault during the shorter time O_DIRECT has get_user_pages
open.

quoted

Really, if I have a get_user_pages FOLL_WRITE on a page and we fork,
then shouldn't the COW immediately be broken during the fork?

The kernel can't guarentee that an ongoing DMA will not write to those
pages, and it breaks the fork semantic to write to both processes.

Fixing that would incur a high cost: need to grow struct page, need
to copy potentialy gigabyte of memory during fork() ... this would be
a serious performance regression for many folks just to work around an
abuse of device driver. So i don't think anything on that front would
be welcome.

Why? Keep track in each mm if there are any active get_user_pages
FOLL_WRITE pages in the mm, if yes then sweep the VMAs and fix the
issue for the FOLL_WRITE pages.

John is already working on being able to detect pages under GUP, so it
seems like a small step..

Since nearly all cases of fork don't have a GUP FOLL_WRITE active
there would be no performance hit.

umem without proper ODP and VFIO are the only bad user i know of (for
VFIO you can argue that it is part of the API contract and thus that
it is not an abuse but it is not spell out loud in documentation). I
have been trying to push back on any people trying to push thing that
would make the same mistake or at least making sure they understand
what is happening.

It is something we have to live with and support for the foreseeable
future.

What really need to happen is people fixing their hardware and do the
right thing (good software engineer versus evil hardware engineer ;))

Even ODP is no pancea, there are performance problems. What we really
need is CAPI like stuff, so you will tell Intel to redesign the CPU??
:)

Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help