Thread (21 messages) 21 messages, 6 authors, 2023-09-19

Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: "Andy Lutomirski" <luto@kernel.org>
Date: 2023-09-19 18:52:55
Also in: linux-arch, linux-cxl, lkml


On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote:
On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
quoted
I'm not complaining about the name.  I'm objecting about the semantics.

Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway).  But that means you have no idea when the logical contents of those pages *changes*.  So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
How does this change if I use virtual address based migration?

I could do sampling based on virtual address (page faults, IBS/PEBs,
whatever), and by the time I make a decision, the kernel could have
migrated the data or even my task from Node A to Node B.  The sample I
took is now stale, and I could make a poor migration decision.
The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example).  The kernel can do this without races from page faults handlers.  If you sample based on PA, you fundamentally race against anything that does migration.
If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
page from NodeA to NodeB, then the device-side collection is likewise
no longer valid.  This problem doesn't change because I used virtual
address compared to physical address.
Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use.
But if i have a 512GB memory device, and i can see a wide swath of that
512GB is hot, while a good chunk of my local DRAM is not - then I
probably don't care *what* gets migrated up to DRAM, i just care that a
vast majority of that hot data does.

The goal here isn't 100% precision, you will never get there. The goal
here is broad-scope performance enhancements of the overall system
while minimizing the cost to compute the migration actions to be taken.

I don't think the contents of the page are always relevant.  The entire
concept here is to enable migration without caring about what programs
are using the memory for - just so long as the memcg's and zoning is
respected.
At the very least I think you need to be aware of page *size*.  And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation.

I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help