Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
From: Alexander Graf <graf@amazon.com>
Date: 2024-06-18 11:48:00
Also in:
lkml
On 18.06.24 12:21, Ard Biesheuvel wrote:
On Mon, 17 Jun 2024 at 23:01, Alexander Graf [off-list ref] wrote:quoted
On 17.06.24 22:40, Steven Rostedt wrote:quoted
On Mon, 17 Jun 2024 09:07:29 +0200 Alexander Graf[off-list ref] wrote:quoted
Hey Steve, I believe we're talking about 2 different things :). Let me rephrase a bit and make a concrete example. Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command line option. The kernel now comes up and allocates a random chunk of memory that - by (admittedly good) chance - may be at the same physical location as before. Let's assume it deemed 0x1000000 as a good offset.Note, it's not random. It picks from the top of available memory every time. But things can mess with it (see below).quoted
Let's now assume you're running on a UEFI system. There, you always have non-volatile storage available to you even in the pre-boot phase. That means the kernel could create a UEFI variable that says "12M:4096:trace -> 0x1000000". The pre-boot phase takes all these UEFI variables and marks them as reserved. When you finally reach your command line parsing logic for reserve_mem=, you can flip all reservations that were not on the command line back to normal memory. That way you have pretty much guaranteed persistent memory regions, even with KASLR changing your memory layout across boots. The nice thing is that the above is an extension of what you've already built: Systems with UEFI simply get better guarantees that their regions persist.This could be an added feature, but it is very architecture specific, and would likely need architecture specific updates.It definitely would be an added feature, yes. But one that allows you to ensure persistence a lot more safely :).quoted
quoted
quoted
quoted
quoted
Requirement: Need a way to reserve memory that will be at a consistent location for every boot, if the kernel and system are the same. Does not need to work if rebooting to a different kernel, or if the system can change the memory layout between boots. The reserved memory can not be an hard coded address, as the same kernel / command line needs to run on several different machines. The picked memory reservation just needs to be the same for a given machine, but may beWith KASLR is enabled, doesn't this approach break too often to be reliable enough for the data you want to extract? Picking up the idea above, with a persistent variable we could even make KASLR avoid that reserved pstore region in its search for a viable KASLR offset.I think I was hit by it once in all my testing. For our use case, the few times it fails to map is not going to affect what we need this for at all.Once is pretty good. Do you know why? Also once out of how many runs? Is the randomness source not as random as it should be or are the number of bits for KASLR maybe so few on your target architecture that the odds of hitting anything become low? Do these same constraints hold true outside of your testing environment?So I just ran it a hundred times in a loop. I added a patch to print the location of "_text". The loop was this: for i in `seq 100`; do ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot' >> text; grub-reboot '1>0'; sleep 0.5; reboot" sleep 25 done It searches dmesg for my added printk as well as the print of were the ring buffer was loaded in physical memory. It takes about 15 seconds to reboot, so I waited 25. The results are attached. I found that it was consistent 76 times, which means 1 out of 4 it's not. Funny enough, it broke whenever it loaded the kernel below 0x100000000. And then it would be off by a little. It was consistently at: 0x27d000000 And when it failed, it was at 0x27ce00000. Note, when I used the e820 tables to do this, I never saw a failure. My assumption is that when it is below 0x100000000, something else gets allocated causing this to get pushed down.Thinking about it again: What if you run the allocation super early (see arch/x86/boot/compressed/kaslr.c:handle_mem_options())?That code is not used by EFI boot anymore. In general, I would recommend (and have recommended) against these kinds of hacks in mainline, because -especially on x86- there is always someone that turns up with some kind of convoluted use case that gets broken if we try to change anything in the boot code. I spent considerable time over the past year making the EFI/x86 boot code compatible with the new MS imposed requirements on PC boot firmware (related to secure boot and NX restrictions on memory mappings). This involved some radical refactoring of the boot sequence, including the KASLR logic. Adding fragile code there that will result in regressions observable to end users when it gets broken is really not what I'd like to see. So I would personally prefer for this code not to go in at all. But if it does go in (and Steven has already agreed to this), it needs a giant disclaimer that it is best effort and may get broken inadvertently by changes that may seem unrelated.
Alright, happy to rest my case about making it more reliable for now then :). IMHO the big fat disclaimer should be in the argument name. "reserve_mem" to me sounds like it actually guarantees a reservation - which it doesn't. Can we name it more along the lines of "debug" (to indicate it's not for production data) or "phoenix" (usually gets reborn out of ashes, but you can never know for sure): "debug_mem", / "phoenix_mem"?
quoted
If you stick to allocating only from top, you're effectively kernel version independent for your allocations because none of the kernel code ran yet and definitely KASLR independent because you're running deterministically before KASLR even gets allocated.Allocating top down under EFI is almost guaranteed to result in problems, because that is how the EFI page allocator works as well. This means that allocations will move around depending on, e.g., whether some USB stick was inserted on the first boot and removed on the second, or whether your external display was on or off.
I believe most UEFI implementations only allocate top down in the lower 32bits. But yes, it's fragile, I hear you. Let's embrace the flaky nature of the beast then :). Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597