Thread (78 messages) 78 messages, 11 authors, 2025-07-02

Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings

From: Peter Xu <peterx@redhat.com>
Date: 2025-06-18 16:56:11
Also in: linux-mm, lkml
Subsystem: memory management - thp (transparent huge page), the rest · Maintainers: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Linus Torvalds

On Tue, Jun 17, 2025 at 07:36:08PM -0400, Peter Xu wrote:
On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote:
quoted
On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
quoted
On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
quoted
On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
quoted
Can I understand it as a suggestion to pass in a bitmask into the core mm
API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
constant "align", so that core mm would try to allocate from the largest
size to smaller until it finds some working VA to use?
I don't think you need a bitmask.

Split the concerns, the caller knows what is inside it's FD. It only
needs to provide the highest pgoff aligned folio/pfn within the FD.
Ultimately I even dropped this hint.  I found that it's not really
get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
So I decided to avoid this parameter as of now.
Well, the point of the pgoff is only what you said earlier, to adjust
the starting alignment so the pgoff aligned high order folios/pfns
line up properly.
I meant "highest pgoff" that I dropped.

We definitely need the pgoff to make it work.  So here I dropped "highest
pgoff" passed from the caller because I decided to leave such check to the
mmap() hook later.
quoted
quoted
quoted
The mm knows what leaf page tables options exist. It should try to
align to the closest leaf page table size that is <= the FD's max
aligned folio.
So again IMHO this is also not per-FD information, but needs to be passed
over from the driver for each call.
It is per-FD in the sense that each FD is unique and each range of
pgoff could have a unique maximum.
 
quoted
Likely the "order" parameter appeared in other discussions to imply a
maximum supported size from the driver side (or, for a folio, but that is
definitely another user after this series can land).
Yes, it is the only information the driver can actually provide and
comes directly from what it will install in the VMA.
quoted
So far I didn't yet add the "order", because currently VFIO definitely
supports all max orders the system supports.  Maybe we can add the order
when there's a real need, but maybe it won't happen in the near
future?
The purpose of the order is to prevent over alignment and waste of
VMA. Your technique to use the length to limit alignment instead is
good enough for VFIO but not very general.
Yes that's also something I didn't like.  I think I'll just go ahead and
add the order parameter, then use it in previous patch too.
So I changed my mind, slightly.  I can still have the "order" parameter to
make the API cleaner (even if it'll be a pure overhead.. because all
existing caller will pass in PUD_SIZE as of now), but I think I'll still
stick with the ifdef in patch 4, as I mentioned here:

https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/ (local)

The problem is I just noticed yet again that exporting
huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
it'll need something like this to make !MMU compile for VFIO, while this is
definitely some ugliness I also want to avoid..

===8<===
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 59fdafb1034b..f40a8fb64eaa 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -548,7 +548,11 @@ static inline unsigned long
 huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
                unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+#ifdef CONFIG_MMU
        return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+#else
+       return 0;
+#endif
 }

 static inline bool
===8<===

The issue is still mm_get_unmapped_area() is only exported on CONFIG_MMU,
so we need to special case that for huge_mapping_get_va_aligned(), and here
for !THP && !MMU.

Besides the ugliness, it's also about how to choose a default value to
return when mm_get_unmapped_area() isn't available.

I gave it a defalut value (0) as example, but I don't even thnk that 0
makes sense.  It would (if ever triggerable from any caller on !MMU) mean
it will return 0 directly to __get_unmapped_area() and further do_mmap()
(of !MMU code, which will come down from ksys_mmap_pgoff() of nommu.c) will
take that addr=0 to be the addr to mmap.. that sounds wrong.

There's just no way to provide a sane default value for !MMU.

So going one step back: huge_mapping_get_va_aligned() (or whatever name we
prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
It should follow the same way we treat mm_get_unmapped_area().

Here it also goes back to the question on why !MMU even support mmap():

https://www.kernel.org/doc/Documentation/nommu-mmap.txt

So, for the case of v4l driver (v4l2_m2m_get_unmapped_area that I used to
quote, which only defines in !MMU and I used to misread..), for example,
it's really a minimal mmap() support on ucLinux and that's all about that.
My gut feeling is the noMMU use case more or less abused the current
get_unmapped_area() hook to provide the physical addresses, so as to make
mmap() work even on ucLinux.

It's for sure not a proof that we should have huge_mapping_get_va_aligned()
or mm_get_unmapped_area() availalbe even for !MMU.  That's all about VAs
and that do not exist in !MMU as a concept.

Thanks,

-- 
Peter Xu
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help