Thread (67 messages) 67 messages, 6 authors, 2020-01-07

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

From: Dan Williams <hidden>
Date: 2019-12-21 00:33:31
Also in: bpf, dri-devel, kvm, linux-block, linux-doc, linux-fsdevel, linux-kselftest, linux-media, linux-mm, linux-rdma, lkml, netdev

On Fri, Dec 20, 2019 at 1:22 AM Jan Kara [off-list ref] wrote:
On Thu 19-12-19 12:30:31, John Hubbard wrote:
quoted
On 12/19/19 5:26 AM, Leon Romanovsky wrote:
quoted
On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
quoted
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.
Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
  ib_umem_geti() ->
   pin_user_pages_fast(FOLL_WRITE) ->
    internal_get_user_pages_fast(FOLL_WRITE) ->
     gup_pgd_range() ->
      gup_huge_pd() ->
       gup_hugepte() ->
        try_grab_compound_head() ->
Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Do you have any idea how many pins (repeated pins on the same page, which
it sounds like you have) might be involved in your test case,
and the huge page and system page sizes? That would allow calculating
if we're likely overflowing for that reason.

So, ideas and next steps:

1. Assuming that you *are* hitting this, I think I may have to fall back to
implementing the "deferred" part of this design, as part of this series, after
all. That means:

  For the pin/unpin calls at least, stop treating all pages as if they are
  a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
  That's not how it works now, and the need to hand back a huge array of
  subpages is part of the problem. This affects the callers too, so it's not
  a super quick change to make. (I was really hoping not to have to do this
  yet.)
Does that mean that you would need to make all GUP users huge page aware?
Otherwise I don't see how what you suggest would work... And I don't think
making all GUP users huge page aware is realistic (effort-wise) or even
wanted (maintenance overhead in all those places).

I believe there might be also a different solution for this: For
transparent huge pages, we could find a space in 'struct page' of the
second page in the huge page for proper pin counter and just account pins
there so we'd have full width of 32-bits for it.
That would require THP accounting for dax pages. It is something that
was probably going to be needed, but this would seem to force the
issue.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help