Re: [RFC 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics
From: Catalin Marinas <catalin.marinas@arm.com>
Date: 2019-07-03 17:52:57
Also in:
linux-mm, lkml
On Tue, Jul 02, 2019 at 09:07:28AM +0530, Anshuman Khandual wrote:
On 06/28/2019 03:50 PM, Catalin Marinas wrote:quoted
On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:quoted
pmd_present() and pmd_trans_huge() are expected to behave in the following manner during various phases of a given PMD. It is derived from a previous detailed discussion on this topic [1] and present THP documentation [2]. pmd_present(pmd): - Returns true if pmd refers to system RAM with a valid pmd_page(pmd) - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd) pmd_trans_huge(pmd): - Returns true if pmd refers to system RAM and is a trans huge mapping
[...]
quoted
Before we actually start fixing this, I would strongly suggest that you add a boot selftest (see lib/Kconfig.debug for other similar cases) which checks the consistency of the page table macros w.r.t. the expected mm semantics. Once the mm maintainers agreed with the semantics, it will really help architecture maintainers in implementing them correctly.Sure and it will help all architectures to be in sync wrt semantics.quoted
You wouldn't need actual page tables, just things like assertions on pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the dummy_* variables on the stack.Hmm. I guess macros which operate directly on a page table entry will be okay but the ones which check on specific states for VMA or MM might be bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will explore adding such a test.
You can pretend that the page table is on the stack. See the _pmd variable in do_huge_pmd_wp_page_fallback() and __split_huge_zero_page_pmd(). Similarly, the vma and even the mm can be faked on the stack (see the arm64 tlb_flush()).
quoted
quoted
The problem: PMD is first invalidated with pmdp_invalidate() before it's splitting. This invalidation clears PMD_SECT_VALID as below. PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false on the PMD entry.I think that's an inconsistency in the expected semantics here. Do you mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
[...]
pmd_present() and pmd_mknotpresent() are not exact inverse.
I find this very confusing (not your fault, just the semantics expected by the core code). I can see that x86 is using _PAGE_PSE to make pmd_present(pmd_mknotpresent()) == true. However, for pud that's not the case (because it's not used for transhuge). I'd rather have this renamed to pmd_mknotvalid().
In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set temporarily to remember that it was a mapped PMD which got invalidated recently but which still points to memory. Hence pmd_present() must evaluate true.
I wonder if we can encode this safely for arm64 in the bottom two bits of a pmd : 0b00 - not valid, not present 0b10 - not valid, present, huge 0b01 - valid, present, huge 0b11 - valid, table (not huge) Do we ever call pmdp_invalidate() on a table entry? I don't think we do. So a pte_mknotvalid would set bit 1 and I think swp_entry_to_pmd() would have to clear it so that pmd_present() actually returns false for a swp pmd entry.
quoted
we need to implement our own pmdp_invalidate() or change the generic one to set a "special" bit instead of just a pmd_mknotpresent?Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate on the invalid and special bits directly. But its not going to alter relevant semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent() from putting in that special bit in there which it is not supposed do. IFAICS there is no compelling reason for generic pmdp_invalidate() to change either. It calls pmd_mknotpresent() which invalidates the entry through valid or present bit and platforms which have dedicated huge page bit can still test positive for pmd_present() after it's invalidation. It works for such platforms. Platform specific override is required when invalidation via pmd_mknotpresent() is not enough.
I'd really like the mknotpresent to be renamed to mknotvalid and then we can keep pmdp_invalidate unchanged (well, calling mknotvalid instead). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel