Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
From: Peter Xu <peterx@redhat.com>
Date: 2024-08-04 15:06:12
Also in:
linux-mm, lkml
On Wed, Jul 31, 2024 at 02:18:26PM +0200, David Hildenbrand wrote:
On 15.07.24 21:21, Peter Xu wrote:quoted
In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages altered by protection changes") introduced "numa_huge_pte_updates" vmstat entry, trying to capture how many huge ptes (in reality, PMD thps at that time) are marked by NUMA balancing. This patch proposes to remove it for some reasons. Firstly, the name is misleading. We can have more than one way to have a "huge pte" at least nowadays, and that's also the major goal of this patch, where it paves way for PUD handling in change protection code paths. PUDs are coming not only for dax (which has already came and yet broken..), but also for pfnmaps and hugetlb pages. The name will simply stop making sense when PUD will start to be involved in mprotect() world. It'll also make it not reasonable either if we boost the counter for both pmd/puds. In short, current accounting won't be right when PUD comes, so the scheme was only suitable at that point in time where PUD wasn't even possible. Secondly, the accounting was simply not right from the start as long as it was also affected by other call sites besides NUMA. mprotect() is one, while userfaultfd-wp also leverages change protection path to modify pgtables. If it wants to do right it needs to check the caller but it never did; at least mprotect() should be there even in 2013. It gives me the impression that nobody is seriously using this field, and it's also impossible to be serious.It's weird and the implementation is ugly. The intention really was to only consider MM_CP_PROT_NUMA, but that apparently is not the case. hugetlb/mprotect/... should have never been accounted. [...]quoted
diff --git a/mm/vmstat.c b/mm/vmstat.c index 73d791d1caad..53656227f70d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c@@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = { #ifdef CONFIG_NUMA_BALANCING "numa_pte_updates", - "numa_huge_pte_updates", "numa_hint_faults", "numa_hint_faults_local", "numa_pages_migrated",It's a user-visible update. I assume most tools should be prepared for this stat missing (just like handling !CONFIG_NUMA_BALANCING). Apparently it's documented [1][2] for some distros:
Yes, and AFAIU, [2] is a document to explain an issue relevant to numa balancing, and I'd highly doubt [2] referenced [1] here; even the order of the parameters are the same to be listed.
"The amount of transparent huge pages that were marked for NUMA hinting faults. In combination with numa_pte_updates the total address space that was marked can be calculated." And now I realize that change_prot_numa() would account these PMD updates as well in numa_pte_updates and I am confused about the SUSE documentation: "In combination with numa_pte_updates" doesn't really apply, right? At this point I don't know what's right or wrong.
Me neither, even without PUD involvement. Talking about numa_pte_updates, hugetlb_change_protection() returns "number of huge ptes", so one 2M hugetlb page is accounted once; while comparing to the generic THP (change_protection_range()) it's HPAGE_PUD_NR. It'll make more sense to me if it sticks with PAGE_SIZE. So all these counters look a bit confusing.
If we'd want to fix it instead, the right thing to do would be doing the accounting only with MM_CP_PROT_NUMA. But then, numa_pte_updates is also wrongly updated I believe :(
Right. I don't have a reason to change numa_pte_updates semantics yet so far, but here there's the problem where numa_huge_pte_updates can be ambiguous when there is even PUD involved. In general, I don't know how I should treat this counter in PUD path even if NUMA isn't involved in dax yet; it can be soon involved if we move on with using this same path for hugetlb, or when 1G thp can be possible (with Yu Zhao's TAO?). One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES in PUD dax processing for now. It'll work for this series, but it'll still be a problem later. I figured maybe we should simply drop it from now. Thanks,
[1] https://documentation.suse.com/de-de/sles/12-SP5/html/SLES-all/cha-tuning-numactl.html [2] https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2749259_1.html -- Cheers, David / dhildenb
-- Peter Xu