Thread (21 messages) 21 messages, 6 authors, 2023-09-19

Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: Gregory Price <hidden>
Date: 2023-09-09 16:26:25
Also in: linux-arch, linux-cxl, lkml

On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote:
quoted
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..25db6d71af0c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,4 @@
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
+454	i386	move_phys_pages		sys_move_phys_pages
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..9676f2e7698c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,7 @@
 451	common	cachestat		sys_cachestat
 452	common	fchmodat2		sys_fchmodat2
 453	64	map_shadow_stack	sys_map_shadow_stack
+454	common	move_phys_pages		sys_move_phys_pages
Doing only x86 is fine for review purposes, but note that
once there is consensus on actually merging it and the syscall
number, you should do the same for all architectures. I think
most commonly we do one patch to add the code and another
patch to hook it up to all the syscall.tbl files and the
include/uapi/asm-generic/unistd.h file.
I'd looked through a few prior patches and that's what I observed so I
tried to follow that.  For the RFC i think it only made sense to
integrate it with the system i'm actually testing on, but I'll
eventually need to test it on ARM and such.

Noted.
quoted
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..6860675a942f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
long nr_pages,
 				const int __user *nodes,
 				int __user *status,
 				int flags);
+asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
+				    const void __user * __user *pages,
+				    const int __user *nodes,
+				    int __user *status,
+				    int flags);
The prototype looks good from a portability point of view,
i.e. I made sure this is portable across CONFIG_COMPAT
architectures etc.

What I'm not sure about is whether the representation of physical
memory pages as a 'void *' is a good choice, I can see potential
problems with this:

- it's not really a pointer, but instead a shifted PFN number
  in the implementation

- physical addresses may be wider than pointers on 32-bit
  architectures with CONFIG_PHYS_ADDR_T_64BIT
Hm, good points.

I tried to keep the code close to move_pages for the sake of
familiarity and ease of review, but the physical address length
is not something i'd considered, and I'm not sure how exactly
we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
I'm happy to run with it.

on address vs pfn:

Would using PFNs cause issues with portability of userland code? User
code that gets access to a physical address would have to convert
that to a PFN, which would be kernel-defined.  That could be easy
enough if the kernel exposed the shift size somewhere.

I can see arguments for PFN as opposed to physical address for
portability sake.  This doesn't handle the 64-bit phys address
configuration issue, of course.

In the case where a user already has a PFN (e.g. via mm/page_idle),
defining it as a PFN interface would certainly make more sense.

Mayhaps handling both cases would be useful, depending on the source
information (see below for more details).
I'm also not sure where the user space caller gets the
physical addresses from, are those not intentionally kept
hidden from userspace?

     Arnd
There are presently 4 places (that I know of), and 1 that is being
proposed here in the near future

1) Generally: /proc/pid/pagemap can be used to do page table
     translations.  I think this is only really useful for testing, 
     since if you have the virtual address and pid you would use
     move_pages, but it's certainly useful for testing this.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     address information instead of virtual address information. In fact
     you can configure it to give you both the virtual and physical
     address for a process.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     While itself seemingly not useful, if the goal is to migrate
     less-used pages to "lower tiers" of memory, then you can query
     the bitmap, directly recover idle PFNs, and attempt to migrate
     them (which may fail, for a variety of reasons).

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html


5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.

~Gregory
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help