Thread (41 messages) 41 messages, 8 authors, 2024-12-17

Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

From: Alistair Popple <apopple@nvidia.com>
Date: 2024-12-16 00:55:39
Also in: linux-arm-kernel, linux-cxl, linux-doc, linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm

On Sat, Dec 14, 2024 at 04:22:58PM +0100, David Hildenbrand wrote:
On 14.12.24 02:39, Dan Williams wrote:
quoted
[ add akpm and sfr for next steps ]

Alistair Popple wrote:
quoted
Main updates since v2:

  - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
    and have them pass the vmf struct.

  - Seperate out the device DAX changes.

  - Restore the page share mapping counting and associated warnings.

  - Rework truncate to require file-systems to have previously called
    dax_break_layout() to remove the address space mapping for a
    page. This found several bugs which are fixed by the first half of
    the series. The motivation for this was initially to allow the FS
    DAX page-cache mappings to hold a reference on the page.

    However that turned out to be a dead-end (see the comments on patch
    21), but it found several bugs and I think overall it is an
    improvement so I have left it here.

Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.
So this is feeling ready for -next exposure, and ideally merged for v6.14. I
see the comments from John and Bjorn and that you were going to respin for
that, but if it's just those details things they can probably be handled
incrementally.

Alistair, are you ready for this to hit -next?
Yeah, I'm pretty happy with the series now. It "feels" right.

There's a couple of dumb build bot errors, so I was going to respin to fix
those as well. I got caught up with a few other things so was just letting this
sit awaiting feedback, but I should be able to post a respin early this week.
quoted
As for which tree...

Andrew, we could take this through -mm, but my first instinct would be to try
to take it through nvdimm.git mainly to offload any conflict wrangling work and
small fixups which are likely to be an ongoing trickle.

However, I am not going to put up much of a fight if others prefer this go
through -mm.

Thoughts?
I'm in the process of preparing v2 of [1] that will result in conflicts with
this series in the rmap code (in particular [PATCH v3 14/25] huge_memory:
Allow mappings of PUD sized pages).

I'll be away for 2 weeks over Christmas, but I assume I'll manage to post v2
shortly.

Which reminds me that I still have to take a closer look at some things in
this series :) Especially also #14 regarding accounting.

I wonder if we could split out the rmap changes in #14, and have that patch
simply in two trees? No idea.
I could split out the first half (patches 1 - 8) into a series to go via
nvdimm.git, because they are actually standalone clean ups that I think are
worthwhile anyway.

The remainder are more -mm focussed. However they do depend on the fs/dax
cleanups in the first half so the trick would be making sure Andrew only takes
them if the nvdimm.git changes have made it into -next. I'm happy with either
approach, so let me know if I should split the series or not.

 - Alistair
[1]
https://lore.kernel.org/all/20240829165627.2256514-1-david@redhat.com/T/#u (local)

-- 
Cheers,

David / dhildenb
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help