Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: 2023-01-25 18:44:50
Also in:
linux-api, linux-arch, linux-mm, lkml
On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
quoted
quoted
Roughly speaking: if we abstract it that way and get all of the "how to set it writable now?" out of core-MM, it not only is cleaner and less error prone, it might even allow other architectures that implement something comparable (e.g., using a dedicated HW bit) to actually reuse some of that work. Otherwise most of that "shstk" is really just x86 specific ... I guess the only cases we have to special case would be page pinning code where pte_write() would indicate that the PTE is writable (well, it is, just not by "ordinary CPU instruction" context directly): but you do that already, so ... :) Sorry for stumbling over that this late, I only started looking into this when you CCed me on that one patch.Sorry for not calling more attention to it earlier. Appreciate your comments. Previously versions of this series had changed some of these pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma. This way an x86 implementation could use the VM_SHADOW_STACK vma flag to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback was that in some of these code paths "maybe" isn't really an option, it *needs* to make it writable. Even though the logic was the same, the name of the function made it look wrong. But another option could be to change pte_mkwrite() to take a vma. This would save using another software bit on x86, but instead requires a small change to each arch's pte_mkwrite().I played with that idea shortly as well, but discarded it. I was not able to convince myself that it wouldn't be required to pass in the VMA as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... which would end up fairly ugly (or even impossible in thing slike GUP-fast). For example, I wonder how we'd be handling stuff like do_numa_page() cleanly correctly, where we use pte_modify() + pte_mkwrite(), and either call might set the PTE writable and maintain dirty bit ...
pte_modify() is handled like this currently: https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/ (local) There has been a couple iterations on that. The current solution is to do the Dirty->SavedDirty fixup if needed after the new prots are added. Of course pte_modify() can't know whether you are are attempting to create a shadow stack PTE with the prot you are passing in. But the callers today explicitly call pte_mkwrite() after filling in the other bits with pte_modify(). Today this patch causes the pte_mkwrite() to be skipped and another fault may be required in the mprotect() and numa cases, but if we change pte_mkwrite() to take a VMA we can just make it shadow stack to start. It might be worth mentioning, there was a suggestion in the past to try to have the shadow stack bits come out of vm_get_page_prot(), but MM code would then try to map the zero page as (shadow stack) writable when there was a normal (non-shadow stack) read access. So I had to abandon that approach and rely on explicit calls to pte_mkwrite/shstk() to make it shadow stack.
Having that said, maybe it could work with only a single saved-dirty bit and passing in the VMA for pte_mkwrite() only. pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty bit to the soft-dirty bit instead, resulting in "writable=0,dirty=0,saved-dirty=1", pte_dirty() would return dirty==1||saved-dirty==1. pte_mkdirty() would set either set dirty=1 or saved-dirty=1, depending on the writable bit. pte_mkclean() would clean both bits. pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" pte_mkwrite() would act according to the VMA, and in addition, merge the saved-dirty bit into the dirty bit. pte_modify() and mk_pte() .... would require more thought ...
Not sure I'm following what the mk_pte() problem would be. You mean if Write=0,Dirty=1 is manually added to the prot? Shouldn't people generally use the pte_mkwrite() helpers unless they are drawing from a prot that was already created with the helpers or vm_get_page_prot()? I think they can't manually create prot's from bits in core mm code, right? And x86 arch code already has to be aware of shadow stack. It's a bit of an assumption I guess, but I think maybe not too crazy of one?
Further, ptep_modify_prot_commit() might have to be adjusted to properly flush in all relevant cases IIRC.
Sorry, I'm not following. Can you elaborate? There is an adjustment made in pte_flags_need_flush().
quoted
x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but maybe it could additionally warn if the vma is not writable. It also seems more aligned with your changes to stop taking hints from PTE bits and just look at the VMA? (I'm thinking about the dropping of the dirty check in GUP and dropping pte_saved_write())The soft-shstk bit wouldn't be a hint, it would be logically changing the "type" of the PTE such that any other PTE functions can do the right thing without having to consume the VMA.
Yea, true. Thanks for your comments and ideas here, I'll give the: pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) ...solution a try.