Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL
From: David Hildenbrand <hidden>
Date: 2020-07-08 08:26:56
Also in:
linux-mm, lkml, nvdimm
On 08.07.20 09:50, Dan Williams wrote:
On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand [off-list ref] wrote:quoted
On 08.07.20 07:27, Mike Rapoport wrote:quoted
On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:quoted
On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport [off-list ref] wrote:quoted
On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:quoted
On 07.07.20 14:13, Mike Rapoport wrote:quoted
On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:quoted
On Tue 07-07-20 13:59:15, Jia He wrote:quoted
This exports memory_add_physaddr_to_nid() for module driver to use. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. Suggested-by: David Hildenbrand <redacted> Signed-off-by: Jia He <redacted> --- arch/arm64/mm/numa.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..7eeb31740248 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c@@ -464,10 +464,11 @@ void __init arm64_numa_init(void) /* * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... + * such that acpi_get_node() succeeds. But when SRAT is not present, the node + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option. */ int memory_add_physaddr_to_nid(u64 addr) { - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); return 0; } +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);Does it make sense to export a noop function? Wouldn't make more sense to simply make it static inline somewhere in a header? I haven't checked whether there is an easy way to do that sanely bu this just hit my eyes.We'll need to either add a CONFIG_ option or arch specific callback to make both non-empty (x86, powerpc, ia64) and empty (arm64, sh) implementations coexist ...Note: I have a similar dummy (return 0) patch for s390x lying around here.Then we'll call it a tie - 3:3 ;-)So I'd be happy to jump on the train of people wanting to export the ARM stub for this (and add a new ARM stub for phys_to_target_node()), but Will did have a plausibly better idea that I have been meaning to circle back to: http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck (local) ...i.e. iterate over node data to do the lookup. This would seem to work generically for multiple archs unless I am missing something?IIRC, only memory assigned to/onlined to a ZONE is represented in the pgdat node span. E.g., not offline memory blocks.So this dovetails somewhat with Will's idea. What if we populated node_data for "offline" ranges? I started there, but then saw ARCH_KEEP_MEMBLOCK and thought it would be safer to just teach phys_to_target_node() to use that rather than update other code paths to expect node_data might not always reflect online data.
We currently need a somewhat-accurate pgdat node span to detect when to
offline a node. See try_offline_node(). This works fairly reliable.
Shrinking the node span is currently fairly easy for !ZONE_DEVICE
memory, because we can rely on pfn_to_online_page() + pfn_to_nid(pfn).
See e.g., find_biggest_section_pfn().
If we glue growing/shrinking the node span to adding/removing of memory
(instead of e.g., onlining/offlining), we can no longer base shrinking
on memmap data. We would have to get the information ("how far can I
shrink the node span, is it empty?") from somewhere else. E.g.,
for_each_memory_block() - but that one does not cover ZONE_DEVICE. And
there are memory blocks which cover multiple nodes, in which case we
only store one of them ... unreliable.
This certainly needs more thought :/
quoted
Esp., when hotplugging + onlining consecutive memory, there won't really be any intersections in most cases if I am not wrong. It would not be "intersection" but rather "closest fit". With overlapping nodes it's even more unclear. Which one to pick?In the overlap case you get what you get. Some signal is better than the noise of a dummy function. The consequences of picking the wrong node might be that the kernel can't properly associate a memory range to its performance data tables in firmware, but then again firmware messed up with an overlapping node definition in the first instance.
I'd be curious if what we are trying to optimize here is actually worth optimizing. IOW, is there a well-known scenario where the dummy value on arm64 would be problematic and is worth the effort? I mean, in all performance relevant setups (ignoring hv_balloon/xen-balloon/prove_store(), which also use memory_add_physaddr_to_nid()), we should have a proper PXM/node specified by the hardware on memory hotadd. The fallback of memory_add_physaddr_to_nid() is not relevant in these scenarios. -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel