Thread (67 messages) 67 messages, 6 authors, 2020-01-07

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

From: Dan Williams <hidden>
Date: 2019-12-21 00:32:33
Also in: bpf, dri-devel, kvm, linux-block, linux-doc, linux-fsdevel, linux-kselftest, linux-media, linux-mm, linux-rdma, lkml, netdev

On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe [off-list ref] wrote:
On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
quoted
On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
quoted
On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
quoted
On 12/19/19 5:26 AM, Leon Romanovsky wrote:
quoted
On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
quoted
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.
Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
   ib_umem_geti() ->
    pin_user_pages_fast(FOLL_WRITE) ->
     internal_get_user_pages_fast(FOLL_WRITE) ->
      gup_pgd_range() ->
       gup_huge_pd() ->
        gup_hugepte() ->
         try_grab_compound_head() ->
Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Considering that establishing these pins is entirely under user
control, we can't have a limit here.
There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?
I guess I mean 'hidden limit' - hitting the limit and failing would
be managable.

I think 7 is probably too low though, but we are not using 1GB huge
pages, only 2M..
What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help