Thread (21 messages) 21 messages, 6 authors, 2023-09-19

Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: "Arnd Bergmann" <arnd@arndb.de>
Date: 2023-09-09 16:25:21
Also in: linux-arch, linux-cxl, lkml

On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote:
On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
quoted
quoted
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..6860675a942f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
long nr_pages,
 				const int __user *nodes,
 				int __user *status,
 				int flags);
+asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
+				    const void __user * __user *pages,
+				    const int __user *nodes,
+				    int __user *status,
+				    int flags);
The prototype looks good from a portability point of view,
i.e. I made sure this is portable across CONFIG_COMPAT
architectures etc.

What I'm not sure about is whether the representation of physical
memory pages as a 'void *' is a good choice, I can see potential
problems with this:

- it's not really a pointer, but instead a shifted PFN number
  in the implementation

- physical addresses may be wider than pointers on 32-bit
  architectures with CONFIG_PHYS_ADDR_T_64BIT
Hm, good points.

I tried to keep the code close to move_pages for the sake of
familiarity and ease of review, but the physical address length
is not something i'd considered, and I'm not sure how exactly
we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
I'm happy to run with it.
I think a pointer to '__u64' is the most appropriate here,
that is compatible between 32-bit and 64-bit architectures
and covers all addresses until we get architectures with
128-bit addressing.

Thinking about it more, I noticed an existing bug in
both sys_move_pages and your current version of
sys_move_phys_pages: the 'pages' array is in fact not
suitable for compat tasks and needs an is_compat_task
check to load a 32-bit address from compat userspace on
the "if (get_user(p, pages + i))" line.
on address vs pfn:

Would using PFNs cause issues with portability of userland code? User
code that gets access to a physical address would have to convert
that to a PFN, which would be kernel-defined.  That could be easy
enough if the kernel exposed the shift size somewhere.
I don't think we currently use PFN anywhere in the syscall
ABI, so this introduces a new basic type into uapi headers that
is currently kernel internal.

A 32-bit PFN is always sufficient to represent all physical
addresses on native 32-bit kernel, but not necessarily for
compat user space on 64-bit kernels with an address space wider
than 44 bits (16 terabyte address range).

For the interface it would also require the same quirk
for compat tasks that I pointed out above that is missing
in sys_move_pages today.

A minor quirk of PFN values is that they require knowing
the page size to convert addresses, but I suppose you need
that anyway.
quoted
I'm also not sure where the user space caller gets the
physical addresses from, are those not intentionally kept
hidden from userspace?
There are presently 4 places (that I know of), and 1 that is being
proposed here in the near future

1) Generally: /proc/pid/pagemap can be used to do page table
     translations.  I think this is only really useful for testing, 
     since if you have the virtual address and pid you would use
     move_pages, but it's certainly useful for testing this.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     address information instead of virtual address information. In fact
     you can configure it to give you both the virtual and physical
     address for a process.
Ah right. I see these already require CAP_SYS_ADMIN permissions
because of the risk of rowhammer or speculative execution attacks,
so I suppose users of move_phys_pages() would need additional
permissions compared to move_pages() to actually use that information.
3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     While itself seemingly not useful, if the goal is to migrate
     less-used pages to "lower tiers" of memory, then you can query
     the bitmap, directly recover idle PFNs, and attempt to migrate
     them (which may fail, for a variety of reasons).

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html


5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.
These do not seem to be problematic from the ASLR perspective, so
I guess it may still be useful without CAP_SYS_ADMIN.

     Arnd
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help