Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
From: Dan Williams <hidden>
Date: 2021-01-12 09:16:06
Also in:
lkml
On Wed, Jan 6, 2021 at 1:55 AM Michal Hocko [off-list ref] wrote:
On Tue 05-01-21 20:07:18, Dan Williams wrote:quoted
While pfn_to_online_page() is able to determine pfn_valid() at subsection granularity it is not able to reliably determine if a given pfn is also online if the section is mixed with ZONE_DEVICE pfns.I would call out the problem more explicitly. E.g. something like " This means that pfn_to_online_page can lead to false positives and allow to return a struct page which is not fully initialized because it belongs to ZONE_DEVICE (PUT AN EXAMPLE OF SUCH AN UNITIALIZED PAGE HERE). That can lead to either crash on PagePoisoned assertion or a silently broken page state with debugging disabled. "
Apologies for the long pause in this conversation, I went off and wrote a test to trigger this condition so I could quote it directly. It turns out soft_offline_page(), even before fixing pfn_to_online_page(), is broken as it leaks a page reference.
I would also appreciate a more specific note about how the existing HW can trigger this. You have mentioned 64MB subsection examples in the other email. It would be great to mention it here as well.
Sure.
quoted
Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a section that mixes ZONE_DEVICE pfns with other online pfns. With SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back to a slow-path check for ZONE_DEVICE pfns in an online section. With this implementation of pfn_to_online_page() pfn-walkers mostly only need to check section metadata to determine pfn validity. In the rare case of mixed-zone sections the pfn-walker will skip offline ZONE_DEVICE pfns as expected.The above paragraph is slightly confusing. You do not require pfn-walkers to check anything right? Section metadata is something that is and should be hidden from them. They are asking for an online page and get it or NULL. Nothing more nothing less.
Yeah, I'll drop it. I was describing what service pfn_to_online_page() performs for a pfn-walker, but it's awkwardly worded.
quoted
Other notes: Because the collision case is rare, and for simplicity, the SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.I do not see a problem with that.quoted
pfn_to_online_page() was already borderline too large to be a macro / inline function, but the additional logic certainly pushed it over that threshold, and so it is moved to an out-of-line helper.Worth a separate patch. The approach is sensible. Thanks! I was worried that we do not have sufficient space for a new flag but the comment explains we have 6 bits available. I haven't double checked that for the current state of the code. The comment is quite recent and I do not remember any substantial changes in this area. Still something that is rather fragile because an unexpected alignment would be a runtime failure which is good to stop random corruptions but not ideal. Is there any way to explicitly test for this? E.g. enforce a shared section by qemu and then trigger a pfn walk?quoted
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <redacted> Signed-off-by: Dan Williams <redacted>[...]quoted
+static int zone_id(const struct zone *zone) +{ + struct pglist_data *pgdat = zone->zone_pgdat; + + return zone - pgdat->node_zones; +}We already have zone_idx()
Noted.
quoted
+ +static void section_taint_zone_device(struct zone *zone, unsigned long pfn) +{ + struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn)); + + if (zone_id(zone) != ZONE_DEVICE) + return; + + if (IS_ALIGNED(pfn, PAGES_PER_SECTION)) + return; + + ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE; +} + /* * Associate the pfn range with the given zone, initializing the memmaps * and resizing the pgdat/zone data to span the added pages. After this@@ -707,6 +769,15 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, resize_pgdat_range(pgdat, start_pfn, nr_pages); pgdat_resize_unlock(pgdat, &flags); + /* + * Subsection population requires care in pfn_to_online_page(). + * Set the taint to enable the slow path detection of + * ZONE_DEVICE pages in an otherwise ZONE_{NORMAL,MOVABLE} + * section. + */ + section_taint_zone_device(zone, start_pfn); + section_taint_zone_device(zone, start_pfn + nr_pages);I think it would be better to add the checks here and only set the flag in the called function. SECTION_TAINT_ZONE_DEVICE should go to where we define helpers for ther flags.
Done.