Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings
From: Andy Lutomirski <luto@kernel.org>
Date: 2023-01-03 20:52:31
Also in:
linux-crypto, linux-mm, linux-patches, lkml
On Tue, Jan 3, 2023 at 11:06 AM Jason A. Donenfeld [off-list ref] wrote:
Hi Andy, Thanks for your constructive suggestions. On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:quoted
quoted
quoted
c) If there's not enough memory to service a page fault, it's not fatal, and no signal is sent. Instead, writes are simply lost.This just seems massively overcomplicated to me. If there isn't enough memory to fault in a page of code, we don't have some magic instruction emulator in the kernel. We either OOM or we wait for memory to show up.Before addressing the other parts of your email, I thought I'd touch on this. Quoting from the email I just wrote Ingo: | *However* - if your big objection to this patch is that the instruction | skipping is problematic, we could actually punt that part. The result | will be that userspace just retries the memory write and the fault | happens again, and eventually it succeeds. From a perspective of | vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll | immediately fallback to the syscall under memory pressure -- but you | could argue that nobody really cares about performance at that point | anyway, and so just retrying the fault until it succeeds is a less | complex behavior that would be just fine. | | Let me know if you think that'd be an acceptable compromise, and I'll | roll it into v15. As a preview, it pretty much amounts to dropping 3/7 | and editing the commit message in this 2/7 patch. IOW, I think the main ideas of the patch work just fine without "point c" with the instruction skipping. Instead, waiting/retrying could potentially work. So, okay, it seems like the two of you both hate the instruction decoder stuff, so I'll plan on working that part in, in one way or another, for v15.quoted
On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar [off-list ref] wrote:quoted
quoted
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK.I have a rather different suggestion: make a special mapping. Jason, you're trying to shoehorn all kinds of bizarre behavior into the core mm, and none of that seems to me to belong to the core mm. Instead, have an actual special mapping with callbacks that does the right thing. No fancy VM flags.Oooo! I like this. Avoiding adding VM_* flags would indeed be nice. I had seen things that I thought looked in this direction with the shmem API, but when I got into the details, it looked like this was meant for something else and couldn't address most of what I wanted here. If you say this is possible, I'll look again to see if I can figure it out. Though, if you have some API name at the top of your head, you might save me some code squinting time.
Look for _install_special_mapping(). --Andy
quoted
Want to mlock it? No, don't do that -- that's absurd. Just arrange so that, if it gets evicted, it's not written out anywhere. And when it gets faulted back in it does the right thing -- see above.Presumably mlock calls are redirected to some function pointer so I can just return EINTR?
Or just don't worry about it. If someone mlocks() it, that's their problem. The point is that no one needs to.
quoted
Zero on fork? I'm sure that's manageable with a special mapping. If not, you can add a new vm operation or similar to make it work. (Kind of like how we extended special mappings to get mremap right a couple years go.) But maybe you don't want to *zero* it on fork and you want to do something more intelligent. Fine -- you control ->fault!Doing something more intelligent would be an interesting development, I guess... But, before I think about that, all mapping have flags; couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the API you have in mind not work that way? (Side note: I also want VM_DONTDUMP to work.)
You really want unmap (the pages, not the vma) on fork, not wipe on fork. It'll be VM_SHARED, and I'm not sure what VM_WIPEONFORK | VM_SHARED does.