Thread (54 messages) 54 messages, 10 authors, 2024-09-06

Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages

From: David Hildenbrand <hidden>
Date: 2024-07-02 11:02:34
Also in: linux-arm-kernel, linux-cxl, linux-doc, linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm

On 02.07.24 12:19, Alistair Popple wrote:
David Hildenbrand [off-list ref] writes:
quoted
On 27.06.24 02:54, Alistair Popple wrote:
quoted
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.
This is distinct from the current mechanism, vmf_insert_pfn_pud,
which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.
Do we really have to involve mapcounts/rmap for daxfs pages at this
point? Or is this only "to make it look more like other pages" ?
The aim of the series is make FS DAX and other ZONE_DEVICE pages look
like other pages, at least with regards to the way they are refcounted.

At the moment they are not refcounted - instead their refcounts are
basically statically initialised to one and there are all these special
cases and functions requiring magic PTE bits (pXX_devmap) to do the
special DAX reference counting. This then adds some cruft to manage
pgmap references and to catch the 2->1 page refcount transition. All
this just goes away if we manage the page references the same as other
pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
the same as normal pages).

So I think to make this work we at least need the mapcounts.
We only really need the mapcounts if we intend to do something like 
folio_mapcount() == folio_ref_count() to detect unexpected folio 
references, and if we have to have things like folio_mapped() working. 
For now that was not required, that's why I am asking.

Background also being that in a distant future folios will be decoupled 
more from other compound pages, and only folios (or "struct anon_folio" 
/ "struct file_folio") would even have mapcounts.

For example, most stuff we map (and refcount!) via vm_insert_page() 
really must stop involving mapcounts. These won't be "ordinary" 
mapcount-tracked folios in the future, they are simply some refcounted 
pages some ordinary driver allocated.

For FS-DAX, if we'll be using the same "struct file_folio" approach as 
for ordinary pageache memory, then this is the right thing to do here.

quoted
I'm asking this because:

(A) We don't support mixing PUD+PMD mappings yet. I have plans to change
     that in the future, but for now you can only map using a single PUD
     or by PTEs. I suspect that's good enoug for now for dax fs?
Yep, that's all we support.
quoted
(B) As long as we have subpage mapcounts, this prevents vmemmap
     optimizations [1]. Is that only used for device-dax for now and are
     there no plans to make use of that for fs-dax?
I don't have any plans to. This is purely focussed on refcounting pages
"like normal" so we can get rid of all the DAX special casing.
quoted
(C) We managed without so far :)
Indeed, although Christoph has asked repeatedly ([1], [2] and likely
others) that this gets fixed and I finally got sick of it coming up
everytime I need to touch something with ZONE_DEVICE pages :)

Also it removes the need for people to understand the special DAX page
recounting scheme and ends up removing a bunch of cruft as a bonus:

  59 files changed, 485 insertions(+), 869 deletions(-)
I'm not challenging the refcounting scheme. I'm purely asking about 
mapcount handling, which is something related but different.
And that's before I clean up all the pgmap reference handling. It also
removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
things could be better IMHO.
Again, all nice things.
quoted
Having that said, with folio->_large_mapcount things like
folio_mapcount() are no longer terribly slow once we weould PTE-map a
PUD-sized folio.

Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
translating to "don't touch the memmap". I think we might want to
tackle that first.
Missed to add a pointer to [2].
Ok. I'm keen to get this series finished and I don't quite get the
connection here, what needs to change there?
include/linux/page-flags.h

"PG_reserved is set for special pages. The "struct page" of such a page 
should in general not be touched (e.g. set dirty) except by its owner. 
Pages marked as PG_reserved include:

...

- Device memory (e.g. PMEM, DAX, HMM)
"

I think we already entered that domain with other ZONE_DEVICE pages 
being returned from vm_normal_folio(), unfortunately. But that really 
must be cleaned up for these pages to not look special anymore.

Agreed that it likely is something that is not blocking this series.

-- 
Cheers,

David / dhildenb
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help