Thread (26 messages) 26 messages, 3 authors, 2021-05-13

Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

From: Steven Price <steven.price@arm.com>
Date: 2021-05-13 10:57:46
Also in: kvmarm, lkml, qemu-devel

On 12/05/2021 18:45, Catalin Marinas wrote:
On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
quoted
On 10/05/2021 19:35, Catalin Marinas wrote:
quoted
On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
quoted
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
quoted
On 04/05/2021 18:40, Catalin Marinas wrote:
quoted
On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
quoted
Given the changes to set_pte_at() which means that tags are restored from
swap even if !PROT_MTE, the only race I can see remaining is the creation of
new PROT_MTE mappings. As you mention an attempt to change mappings in the
VMM memory space should involve a mmu notifier call which I think serialises
this. So the remaining issue is doing this in a separate address space.

So I guess the potential problem is:

   * allocate memory MAP_SHARED but !PROT_MTE
   * fork()
   * VM causes a fault in parent address space
   * child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of
handling that.
[...]
quoted
quoted
Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
    and avoid reading stale tags. We document that mprotect() on
    MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
    arch code and return an error.
This is the best option I've come up with so far - but it's not a good
one! We can replace the set_bit() with a test_and_set_bit() to catch the
race after it has occurred - but I'm not sure what we can do about it
then (we've already wiped the data). Returning an error doesn't seem
particularly useful at that point, a message in dmesg is about the best
I can come up with.
What I meant about intercepting is on something like
arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
for mprotect(), not mmap(). However, arch_validate_flags() is currently
called on both mmap() and mprotect() paths.
I think even if we were to restrict mprotect() there would be corner
cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
is faulted simultaneously in both processes then we have the same situation:

 * with test_and_set_bit() one process could potentially see the tags
before they have been restored - i.e. a data leak.

 * with separated test and set then one process could write to the tags
before the second restore has completed causing a lost update.

Obviously completely banning VM_SHARED|VM_MTE might work, but I don't
think that's a good idea.
We can't do much in set_pte_at() to prevent the race with only a single
bit.
quoted
quoted
2. Figure out some other locking in the core code. However, if
    mprotect() in one process can race with a handle_pte_fault() in
    another, on the same shared mapping, it's not trivial.
    filemap_map_pages() would take the page lock before calling
    do_set_pte(), so mprotect() would need the same page lock.
I can't see how this is going to work without harming the performance of
non-MTE work. Ultimately we're trying to add some sort of locking for
two (mostly) unrelated processes doing page table operations, which will
hurt scalability.
Another option is to have an arch callback to force re-faulting on the
pte. That means we don't populate it back after the invalidation in the
change_protection() path. We could do this only if the new pte is tagged
and the page doesn't have PG_mte_tagged. The faulting path takes the
page lock IIUC.
As above - I don't think this race is just on the change_protection() path.
Well, at least for stage 1, I haven't thought much about stage 2.
quoted
quoted
3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
    set it around the other PG_arch_* bit setting).
This is certainly tempting, although sadly the existing
wait_on_page_bit() is sleeping - so this would either be a literal spin,
or we'd need to implement a new non-sleeping wait mechanism.
Yeah, it would have to be a custom spinning mechanism, something like:

	/* lock the page */
	while (test_and_set_bit(PG_arch_3, &page->flags))
		smp_cond_load_relaxed(&page->flags, !(VAL & PG_arch_3));
	...
	/* unlock the page */
	clear_bit(PG_arch_3, &page->flags);
Presumably we'd also need to ensure interrupts are disabled to ensure
we're not pre-empted in the middle and potentially deadlock. It's
doable, but I'd prefer not to invent a new lock type if possible.
quoted
4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
the MTE case where we have to sync tags (see below). What the actual
performance impact of this is I've no idea. It could certainly be bad
if there are a lot of pages with MTE enabled, which sadly is exactly
the case if KVM is used with MTE :(

--->8----
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 0d320c060ebe..389ad40256f6 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,23 +25,33 @@
 u64 gcr_kernel_excl __ro_after_init;
 static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
 			       bool pte_is_tagged)
 {
 	pte_t old_pte = READ_ONCE(*ptep);
+	if (!is_swap_pte(old_pte) && !pte_is_tagged)
+		return;
+
+	spin_lock_irqsave(&tag_sync_lock, flags);
+
+	/* Recheck with the lock held */
+	if (test_bit(PG_mte_tagged, &page->flags))
+		goto out;
Can we skip the lock if the page already has the PG_mte_tagged set?
That's assuming that we set the flag only after clearing the tags. The
normal case where mprotect() is called on a page already mapped with
PROT_MTE would not be affected.
It was missing from the diff context (sorry, should have double checked
that), but I was keeping the check in mte_sync_tags():

  void mte_sync_tags(pte_t *ptep, pte_t pte)
  {
	struct page *page = pte_page(pte);
	long i, nr_pages = compound_nr(page);
	bool check_swap = nr_pages == 1;
	bool pte_is_tagged = pte_tagged(pte);
	unsigned long flags;

	/* Early out if there's nothing to do */
	if (!check_swap && !pte_is_tagged)
		return;

	/* if PG_mte_tagged is set, tags have already been initialised */
	for (i = 0; i < nr_pages; i++, page++) {
		if (!test_bit(PG_mte_tagged, &page->flags))
			mte_sync_page_tags(page, ptep, check_swap,
					   pte_is_tagged);
	}
  }

So the hit is only taken if !PG_mte_tagged - hence the "recheck" comment
in mte_sync_page_tags() once the lock is held. I guess if I'm going this
route it would make sense to refactor this to be a bit clearer.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help