Thread (29 messages) 29 messages, 7 authors, 2021-10-20

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

From: Jason Gunthorpe <jgg@ziepe.ca>
Date: 2021-10-18 18:26:03
Also in: amd-gfx, dri-devel, linux-mm, linux-xfs, nvdimm

On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote:
quoted
DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
THP would make using normal refconting much simpler. I looked at
teaching the mm core to deal with page arrays - it is certainly
doable, but it is quite inefficient and ugly mm code.
THP does not support PUD, and neither does FSDAX, so it's only PMDs we
need to worry about.
device-dax uses PUD, along with TTM, they are the only places. I'm not
sure TTM is a real place though.
quoted
So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?

Joao has a series that does this to device-dax:

https://lore.kernel.org/all/20210827145819.16471-1-joao.m.martins@oracle.com/ (local)
That assumes there's never any need to fracture a huge page which
FSDAX could not support unless the filesystem was built with 2MB block
size.
As I understand things, something like FSDAX post-folio should
generate maximal compound pages for extents in the page cache that are
physically contiguous.

A high order folio can be placed in any lower order in the page
tables, so we never have to fracture it, unless the underlying page
are moved around - which requires an unmap_mapping_range() cycle..
quoted
Assuming changing FSDAX is hard.. How would DAX people feel about just
deleting the PUD/PMD support until it can be done with compound pages?
There are end users that would notice the PMD regression, and I think
FSDAX PMDs with proper compound page metadata is on the same order of
work as fixing the refcount.
Hmm, I don't know.. I sketched out the refcount stuff and the code is
OK but ugly and will add a conditional to some THP cases

On the other hand, making THP unmap cases a bit slower is probably a
net win compared to making put_page a bit slower.. Considering unmap
is already quite heavy.
quoted
4) Ask what the pgmap owner wants to do:

    if (head->pgmap->deny_foll_longterm)
          return FAIL
The pgmap itself does not know, but the "holder" could specify this
policy. 
Here I imagine the thing that creates the pgmap would specify the
policy it wants. In most cases the policy is tightly coupled to what
the free function in the the provided dev_pagemap_ops does..
Which is in line with the 'dax_holder_ops' concept being introduced
for reverse mapping support. I.e. when the FS claims the dax-device
it can specify at that point that it wants to forbid longterm.
Which is a reasonable refinment if we think there are cases where two
nvdim users would want different things.

Anyhow, I'm wondering on a way forward. There are many balls in the
air, all linked:
 - Joao's compound page support for device_dax and more
 - Alex's DEVICE_COHERENT
 - The refcount normalization
 - Removing the pgmap test from GUP
 - Removing the need for the PUD/PMD/PTE special bit
 - Removing the need for the PUD/PMD/PTE devmap bit
 - Remove PUD/PMD vma_is_special
 - folios for fsdax
 - shootdown for fsdax

Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on
the current semantics as that will only make it even harder to fix..

I think it would be good to see Joao's compound page support move
ahead..

So.. Does anyone want to work on finishing this patch series?? I can
give some guidance on how I think it should work at least

Jason
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help