Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

[RFC PATCH v3 0/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-07
[RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-07
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Matthew Wilcox <willy@infradead.org> · 2021-02-07
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-07
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Matthew Wilcox <willy@infradead.org> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-08
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · David Hildenbrand <hidden> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-08
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · David Hildenbrand <hidden> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · David Rientjes <rientjes@google.com> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-08
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Arnd Bergmann <arnd@kernel.org> · 2021-02-07
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Andy Lutomirski <luto@amacapital.net> · 2021-02-07
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Greg KH <gregkh@linuxfoundation.org> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Greg KH <gregkh@linuxfoundation.org> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Matthew Wilcox <willy@infradead.org> · 2021-02-10
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · David Hildenbrand <hidden> · 2021-02-08
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Jason Gunthorpe <jgg@ziepe.ca> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-08
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Jason Gunthorpe <jgg@ziepe.ca> · 2021-02-08
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-09
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Jason Gunthorpe <jgg@ziepe.ca> · 2021-02-09
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-10
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Jason Gunthorpe <jgg@ziepe.ca> · 2021-02-10
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · Song Bao Hua (Barry Song) <hidden> · 2021-02-10
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin · David Hildenbrand <hidden> · 2021-02-11
[RFC PATCH v3 2/2] selftests/vm: add mempinfd test · Zhou Wang <wangzhou1@hisilicon.com> · 2021-02-07

From: Matthew Wilcox <willy@infradead.org>
Date: 2021-02-07 21:35:29
Also in: linux-api, linux-arm-kernel, linux-iommu, linux-mm

On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:

SVA(share virtual address) offers a way for device to share process virtual
address space safely, which makes more convenient for user space device
driver coding. However, IO page faults may happen when doing DMA
operations. As the latency of IO page fault is relatively big, DMA
performance will be affected severely when there are IO page faults.

quoted

From a long term view, DMA performance will be not stable.

In high-performance I/O cases, accelerators might want to perform
I/O on a memory without IO page faults which can result in dramatically
increased latency. Current memory related APIs could not achieve this
requirement, e.g. mlock can only avoid memory to swap to backup device,
page migration can still trigger IO page fault.

Well ... we have two requirements.  The application wants to not take
page faults.  The system wants to move the application to a different
NUMA node in order to optimise overall performance.  Why should the
application's desires take precedence over the kernel's desires?  And why
should it be done this way rather than by the sysadmin using numactl to
lock the application to a particular node?

+struct mem_pin_container {
+	struct xarray array;
+	struct mutex lock;
+};

I don't understand what the lock actually protects.

+struct pin_pages {
+	unsigned long first;
+	unsigned long nr_pages;
+	struct page **pages;
+};

I don't think you need 'first', and I think you can embed the pages
array into this struct, removing one allocation.

+	xa_for_each(&priv->array, idx, p) {
+		unpin_user_pages(p->pages, p->nr_pages);
+		xa_erase(&priv->array, p->first);
+		vfree(p->pages);
+		kfree(p);
+	}
+
+	mutex_destroy(&priv->lock);
+	xa_destroy(&priv->array);

If you just called xa_erase() on every element of the array, you don't need
to call xa_destroy().

+	if (!can_do_mlock())
+		return -EPERM;

You check for can_do_mlock(), but you don't account the pages to this
rlimit.

+	first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;

You don't need to mask off the bits, the shift will remove them.

+	last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;

DIV_ROUND_UP()?

+	pages = vmalloc(nr_pages * sizeof(struct page *));

kvmalloc().  vmalloc() always allocates at least a page, so we want to
use kmalloc if the size is small.  Also, use array_size() -- I know this
can't overflow, but let's be clear

+	ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
+				  flags | FOLL_LONGTERM, pages);
+	if (ret != nr_pages) {
+		pr_err("mempinfd: Failed to pin page\n");

No.  You mustn't allow the user to be able to generate messages to syslog,
just by passing garbage to a syscall.

+	ret = xa_insert(&priv->array, p->first, p, GFP_KERNEL);
+	if (ret)
+		goto unpin_pages;

Hmm.  So we can't pin two ranges which start at the same address, but we
can pin two overlapping ranges.  Is that OK?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help