Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall
From: "Arnd Bergmann" <arnd@arndb.de>
Date: 2023-09-09 16:25:21
Also in:
linux-arch, linux-cxl, lkml
On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote:
On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:quoted
quoted
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 22bc6bc147f8..6860675a942f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h@@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsignedlong nr_pages, const int __user *nodes, int __user *status, int flags); +asmlinkage long sys_move_phys_pages(unsigned long nr_pages, + const void __user * __user *pages, + const int __user *nodes, + int __user *status, + int flags);The prototype looks good from a portability point of view, i.e. I made sure this is portable across CONFIG_COMPAT architectures etc. What I'm not sure about is whether the representation of physical memory pages as a 'void *' is a good choice, I can see potential problems with this: - it's not really a pointer, but instead a shifted PFN number in the implementation - physical addresses may be wider than pointers on 32-bit architectures with CONFIG_PHYS_ADDR_T_64BITHm, good points. I tried to keep the code close to move_pages for the sake of familiarity and ease of review, but the physical address length is not something i'd considered, and I'm not sure how exactly we would handle CONFIG_PHYS_ADDR_T_64BIT. If you have an idea, I'm happy to run with it.
I think a pointer to '__u64' is the most appropriate here, that is compatible between 32-bit and 64-bit architectures and covers all addresses until we get architectures with 128-bit addressing. Thinking about it more, I noticed an existing bug in both sys_move_pages and your current version of sys_move_phys_pages: the 'pages' array is in fact not suitable for compat tasks and needs an is_compat_task check to load a 32-bit address from compat userspace on the "if (get_user(p, pages + i))" line.
on address vs pfn: Would using PFNs cause issues with portability of userland code? User code that gets access to a physical address would have to convert that to a PFN, which would be kernel-defined. That could be easy enough if the kernel exposed the shift size somewhere.
I don't think we currently use PFN anywhere in the syscall ABI, so this introduces a new basic type into uapi headers that is currently kernel internal. A 32-bit PFN is always sufficient to represent all physical addresses on native 32-bit kernel, but not necessarily for compat user space on 64-bit kernels with an address space wider than 44 bits (16 terabyte address range). For the interface it would also require the same quirk for compat tasks that I pointed out above that is missing in sys_move_pages today. A minor quirk of PFN values is that they require knowing the page size to convert addresses, but I suppose you need that anyway.
quoted
I'm also not sure where the user space caller gets the physical addresses from, are those not intentionally kept hidden from userspace?There are presently 4 places (that I know of), and 1 that is being proposed here in the near future 1) Generally: /proc/pid/pagemap can be used to do page table translations. I think this is only really useful for testing, since if you have the virtual address and pid you would use move_pages, but it's certainly useful for testing this. 2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical address information instead of virtual address information. In fact you can configure it to give you both the virtual and physical address for a process.
Ah right. I see these already require CAP_SYS_ADMIN permissions because of the risk of rowhammer or speculative execution attacks, so I suppose users of move_phys_pages() would need additional permissions compared to move_pages() to actually use that information.
3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones
4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle.
While itself seemingly not useful, if the goal is to migrate
less-used pages to "lower tiers" of memory, then you can query
the bitmap, directly recover idle PFNs, and attempt to migrate
them (which may fail, for a variety of reasons).
https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
hot/cold information about its memory. If a heatmap is provided,
for example, it would have to be a device address (0-based) or a
physical address (some base defined by the kernel and programmed
into device decoders such that it can convert them to 0-based).
This is presently being proposed but has not been agreed upon yet.
These do not seem to be problematic from the ASLR perspective, so
I guess it may still be useful without CAP_SYS_ADMIN.
Arnd