Re: [PATCH v4 12/25] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
From: Alistair Popple <apopple@nvidia.com>
Date: 2025-01-06 02:08:11
Also in:
linux-arm-kernel, linux-cxl, linux-doc, linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm
On Fri, Dec 20, 2024 at 08:06:48PM +0100, David Hildenbrand wrote:
On 20.12.24 20:01, David Hildenbrand wrote:quoted
On 17.12.24 06:12, Alistair Popple wrote:quoted
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault. Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Dan Williams <redacted> --- Changes since v2: - New patch split out from "mm/memory: Add dax_insert_pfn" --- mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-)diff --git a/mm/memory.c b/mm/memory.c index 06bb29e..cd82952 100644 --- a/mm/memory.c +++ b/mm/memory.c@@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, } static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, - unsigned long addr, struct page *page, pgprot_t prot) + unsigned long addr, struct page *page, + pgprot_t prot, bool mkwrite) { struct folio *folio = page_folio(page); + pte_t entry = ptep_get(pte); pte_t pteval; - if (!pte_none(ptep_get(pte))) - return -EBUSY; + if (!pte_none(entry)) { + if (!mkwrite) + return -EBUSY; + + /* + * For read faults on private mappings the PFN passed in may not + * match the PFN we have mapped if the mapped PFN is a writeable + * COW page. In the mkwrite case we are creating a writable PTE + * for a shared mapping and we expect the PFNs to match. If they + * don't match, we are likely racing with block allocation and + * mapping invalidation so just skip the update. + */Would it make sense to instead have here /* See insert_pfn(). */ But ...quoted
+ if (pte_pfn(entry) != page_to_pfn(page)) { + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); + return -EFAULT; + } + entry = maybe_mkwrite(entry, vma); + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) + update_mmu_cache(vma, addr, pte);... I am not sure if we want the above at all. Someone inserted a page, which is refcounted + mapcounted already. Now you ignore that and do like the second insertion "worked" ? No, that feels wrong, I suspect you will run into refcount+mapcount issues. If there is already something, inserting must fail IMHO. If you want to change something to upgrade write permissions, then a different interface should be used.Ah, now I realize that the early exit saves you because we won't adjust the refcount +mapcount.
Right.
I still wonder if that really belongs in here, I would prefer to not play such tricks to upgrade write permissions if possible.
As you have pointed out this was all inspired (ie. mostly copied)
from the existing insert_pfn() implementation which is used from
vmf_insert_mixed{_mkwrite}().
I agree a different interface to upgrade permissions would be nice. However
it's tricky because in general callers of these functions (eg. FS DAX) aren't
aware if the page is already mapped by a PTE/PMD. They only know a fault has
occured and the faulting permissions.
This wouldn't be impossible to fix - the mm does provide vm_ops->page_mkwrite()
for permission upgrades. The difficulty is that most filesystems that support
FS DAX (ie. ext4, XFS) don't treat a vm_ops->page_mkwrite() call any differently
from a vm_ops->fault() call due to write fault. Therefore the FS DAX code is
unaware of whether or not this is a permission upgrade or initial writeable
mapping of the page in the VMA.
A further issue in there is currently no vm_ops->huge_mkwrite() callback.
Obviously this could all be plumbed through the MM/FS layers, but that would
require a separate patch series. Given the current implementation has no issues
beyond the cosmetic I'd rather not delay this series any longer, especially as
the cosmetic defect is largely pre-existing (vmf_insert_mixed{_mkwrite}() could
have equally had a separate upgrade interface).
-- Cheers, David / dhildenb