Thread (19 messages) 19 messages, 4 authors, 2020-05-02

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

From: Dan Williams <hidden>
Date: 2020-05-01 20:12:22
Also in: linux-acpi, linux-hyperv, linux-mm, linux-s390, lkml, nvdimm, virtualization, xen-devel

Possibly related (same subject, not in this thread)

On Fri, May 1, 2020 at 12:18 PM David Hildenbrand [off-list ref] wrote:
On 01.05.20 20:43, Dan Williams wrote:
quoted
On Fri, May 1, 2020 at 11:14 AM David Hildenbrand [off-list ref] wrote:
quoted
On 01.05.20 20:03, Dan Williams wrote:
quoted
On Fri, May 1, 2020 at 10:51 AM David Hildenbrand [off-list ref] wrote:
quoted
On 01.05.20 19:45, David Hildenbrand wrote:
quoted
On 01.05.20 19:39, Dan Williams wrote:
quoted
On Fri, May 1, 2020 at 10:21 AM David Hildenbrand [off-list ref] wrote:
quoted
On 01.05.20 18:56, Dan Williams wrote:
quoted
On Fri, May 1, 2020 at 2:34 AM David Hildenbrand [off-list ref] wrote:
quoted
On 01.05.20 00:24, Andrew Morton wrote:
quoted
On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand [off-list ref] wrote:
quoted
quoted
Why does the firmware map support hotplug entries?
I assume:

The firmware memmap was added primarily for x86-64 kexec (and still, is
mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
get hotplugged on real HW, they get added to e820. Same applies to
memory added via HyperV balloon (unless memory is unplugged via
ballooning and you reboot ... the the e820 is changed as well). I assume
we wanted to be able to reflect that, to make kexec look like a real reboot.

This worked for a while. Then came dax/kmem. Now comes virtio-mem.


But I assume only Andrew can enlighten us.

@Andrew, any guidance here? Should we really add all memory to the
firmware memmap, even if this contradicts with the existing
documentation? (especially, if the actual firmware memmap will *not*
contain that memory after a reboot)
For some reason that patch is misattributed - it was authored by
Shaohui Zheng [off-list ref], who hasn't been heard from in
a decade.  I looked through the email discussion from that time and I'm
not seeing anything useful.  But I wasn't able to locate Dave Hansen's
review comments.
Okay, thanks for checking. I think the documentation from 2008 is pretty
clear what has to be done here. I will add some of these details to the
patch description.

Also, now that I know that esp. kexec-tools already don't consider
dax/kmem memory properly (memory will not get dumped via kdump) and
won't really suffer from a name change in /proc/iomem, I will go back to
the MHP_DRIVER_MANAGED approach and
1. Don't create firmware memmap entries
2. Name the resource "System RAM (driver managed)"
3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.

This way, kernel users and user space can figure out that this memory
has different semantics and handle it accordingly - I think that was
what Eric was asking for.

Of course, open for suggestions.
I'm still more of a fan of this being communicated by "System RAM"
I was mentioning somewhere in this thread that "System RAM" inside a
hierarchy (like dax/kmem) will already be basically ignored by
kexec-tools. So, placing it inside a hierarchy already makes it look
special already.

But after all, as we have to change kexec-tools either way, we can
directly go ahead and flag it properly as special (in case there will
ever be other cases where we could no longer distinguish it).
quoted
being parented especially because that tells you something about how
the memory is driver-managed and which mechanism might be in play.
The could be communicated to some degree via the resource hierarchy.

E.g.,

            [root@localhost ~]# cat /proc/iomem
            ...
            140000000-33fffffff : Persistent Memory
              140000000-1481fffff : namespace0.0
              150000000-33fffffff : dax0.0
                150000000-33fffffff : System RAM (driver managed)

vs.

           :/# cat /proc/iomem
            [...]
            140000000-333ffffff : virtio-mem (virtio0)
              140000000-147ffffff : System RAM (driver managed)
              148000000-14fffffff : System RAM (driver managed)
              150000000-157ffffff : System RAM (driver managed)

Good enough for my taste.
quoted
What about adding an optional /sys/firmware/memmap/X/parent attribute.
I really don't want any firmware memmap entries for something that is
not part of the firmware provided memmap. In addition,
/sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
and two arm configs enable it at all.

So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
I think that's a policy decision and policy decisions do not belong in
the kernel. Give the tooling the opportunity to decide whether System
RAM stays that way over a kexec. The parenthetical reference otherwise
looks out of place to me in the /proc/iomem output. What makes it
"driver managed" is how the kernel handles it, not how the kernel
names it.
At least, virtio-mem is different. It really *has to be handled* by the
driver. This is not a policy. It's how it works.
...but that's not necessarily how dax/kmem works.
Yes, and user space could still take that memory and add it to the
firmware memmap if it really wants to. It knows that it is special. It
can figure out that it belongs to a dax device using /proc/iomem.
quoted
quoted
quoted
Oh, and I don't see why "System RAM (driver managed)" would hinder any
policy in user case to still do what it thinks is the right thing to do
(e.g., for dax).

"System RAM (driver managed)" would mean: Memory is not part of the raw
firmware memmap. It was detected and added by a driver. Handle with
care, this is special.
Oh, no, I was more reacting to your, "don't update
/sys/firmware/memmap for the (driver managed) range" choice as being a
policy decision. It otherwise feels to me "System RAM (driver
managed)" adds confusion for casual users of /proc/iomem and for clued
in tools they have the parent association to decide policy.
Not sure if I understand correctly, so bear with me :).

Adding or not adding stuff to /sys/firmware/memmap is not a policy
decision. If it's not part of the raw firmware-provided memmap, it has
nothing to do in /sys/firmware/memmap. That's what the documentation
from 2008 tells us.
It just occurs to me that there are valid cases for both wanting to
start over with driver managed memory with a kexec and keeping it in
the map.
Yes, there might be valid cases. My gut feeling is that in the general
case, you want to let the kexec kernel implement a policy/ let the user
in the new system decide.

But as I said, you can implement in kexec-tools whatever policy you
want. It has access to all information.
Right, so why is a new type needed if all the information is there by
other means?
quoted
Consider the case of EFI Special Purpose (SP) Memory that is
marked EFI Conventional Memory with the SP attribute. In that case the
firmware memory map marked it as conventional RAM, but the kernel
optionally marks it as System RAM vs Soft Reserved. The 2008 patch
simply does not consider that case. I'm not sure strict textualism
works for coding decisions.
I am no expert on that matter (esp EFI). But looking at the users of
firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
. So the single source of /sys/firmware/memmap is (besides hotplug) e820.

"'e820_table_firmware': the original firmware version passed to us by
the bootloader - not modified by the kernel. ... inform the user about
the firmware's notion of memory layout via /sys/firmware/memmap"
(arch/x86/kernel/e820.c)

How is the EFI Special Purpose (SP) Memory represented in e820?
/sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
e820 now has a Soft Reserved translation for this which means "try to
reserve, but treat as System RAM is ok too". It seems generically
useful to me that the toggle for determining whether Soft Reserved or
System RAM shows up /sys/firmware/memmap is a determination that
policy can make. The kernel need not preemptively block it.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help