Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
From: David Hildenbrand <hidden>
Date: 2021-01-12 09:45:52
Also in:
lkml
On 12.01.21 10:18, Dan Williams wrote:
On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand [off-list ref] wrote:quoted
[...]quoted
quoted
quoted
Well, I would love to have no surprises either. So far there was not actual argument why the pmem reserved space cannot be fully initialized.Yes, I'm still hoping Dan can clarify that.Complexity and effective utility (once pfn_to_online_page() is fixed) are the roadblocks in my mind. The altmap is there to allow for PMEM capacity to be used as memmap space, so there would need to be code to break that circular dependency and allocate a memmap for the metadata space from DRAM and the rest of the memmap space for the data capacity from pmem itself. That memmap-for-pmem-metadata will still represent offline pages. So once pfn_to_online_page() is fixed, what pfn-walker is going to be doing pfn_to_page() on PMEM metadata? Secondly, thereAssume I do pgmap = get_dev_pagemap(pfn, NULL); if (pgmap) return pfn_to_page(pfn); return NULL; on a random pfn because I want to inspect ZONE_DEVICE PFNs.I keep getting hung up on the motivation to do random pfn inspection? The problems we have found to date have required different solutions. The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it could rely on the fact that the page already had an elevated reference count. The get_user_pages() path only looks up ZONE_DEVICE pfns when it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers have been a problem, but with pfn_to_online_page() fixed what is the remaining motivation to inspect ZONE_DEVICE pfns?
1) Let's assume we want to implement zone shrinking (remove_pfn_range_from_zone()->shrink_zone_span()) for ZONE_DEVICE at some point. A simple approach would be going via get_dev_pagemap(pfn, NULL)->pfn_to_page(pfn), checking for the zone. If that's not possible, then extending dev_pagemap (e.g., indicating the nid) might also work (unless there is another way to get the nid). 2) Let's take a look at mm/memory-failure.c:memory_failure_dev_pagemap() IIUC, we might end up doing pfn_to_page(pfn) on a pfn in the reserved altmap space, so one with an uninitialized memmap. E.g., in dax_lock_page() we access page->mapping, which might just be garbage. dax_mapping() will de-reference garbage. Most probably I am missing something here. Question is: what are the expectations regarding the memmap if get_dev_pagemap() succeeded. I'm fine documenting that "get_dev_pagemap() does not guarantee that the "struct page" returned by pfn_to_page() was initialized and can safely be used. E.g., it might be a pfn in the reserved altmap space, for which the memmap is never initialized. Accessing it might be dangerous.". Then, there has to be a check at relevant places (e.g., memory_failure_dev_pagemap()), checking somehow if the memmap content can actually be used. -- Thanks, David / dhildenb