Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

From: David Hildenbrand <hidden>
Date: 2022-08-26 14:47:45
Also in: linux-mm, lkml, stable

On 26.08.22 16:32, Peter Xu wrote:

On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:

quoted

Peter Xu [off-list ref] writes:

quoted

On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:

quoted

Peter Xu [off-list ref] writes:

quoted

On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:

quoted

migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over
to the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous
mappings. That means loss of the dirty bit usually doesn't result in
data loss because these pages are typically not file-backed. However
pages may be backed by swap storage which can result in data loss if an
attempt is made to migrate a dirty page that doesn't yet have the
PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte
to match what try_to_migrate_one() does.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reported-by: Huang Ying <redacted>
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Cc: stable@vger.kernel.org

---

Changes for v3:

 - Defer TLB flushing
 - Split a TLB flushing fix into a separate change.

Changes for v2:

 - Fixed up Reported-by tag.
 - Added Peter's Acked-by.
 - Atomically read and clear the pte to prevent the dirty bit getting
   set after reading it.
 - Added fixes tag
---
 mm/migrate_device.c |  9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6a5ef9f..51d9afa 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c

@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/oom.h>

@@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 			if (anon_exclusive) {
 				flush_cache_page(vma, addr, pte_pfn(*ptep));
-				ptep_clear_flush(vma, addr, ptep);
+				pte = ptep_clear_flush(vma, addr, ptep);

 				if (page_try_share_anon_rmap(page)) {
 					set_pte_at(mm, addr, ptep, pte);

@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 					goto next;
 				}
 			} else {
-				ptep_get_and_clear(mm, addr, ptep);
+				pte = ptep_get_and_clear(mm, addr, ptep);
 			}

I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
moved above the condition check so they're called unconditionally.  Could
you explain the rational on why it's changed back (since I think v2 was the
correct approach)?

Mainly because I agree with your original comments, that it would be
better to keep the batching of TLB flushing if possible. After the
discussion I don't think there is any issues with HW pte dirty bits
here. There are already other cases where HW needs to get that right
anyway (eg. zap_pte_range).

Yes tlb batching was kept, thanks for doing that way.  Though if only apply
patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
be redundant.

quoted

The other question is if we want to split the patch, would it be better to
move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?

Isn't that already the case? Patch 1 moves the TLB flush before the PTL
as suggested, patch 2 atomically copies the dirty bit without changing
any TLB flushing.

IMHO it's cleaner to have patch 1 fix batch flush, replace
ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

Which ptep_clear_flush() are you referring to? This one?

			if (anon_exclusive) {
				flush_cache_page(vma, addr, pte_pfn(*ptep));
				ptep_clear_flush(vma, addr, ptep);

Correct.

quoted

My understanding is that we need to do a flush for anon_exclusive.

To me anon exclusive only shows this mm exclusively owns this page. I
didn't quickly figure out why that requires different handling on tlb
flushs.  Did I perhaps miss something?

GUP-fast is the magic bit, we have to make sure that we won't see new
GUP pins, thus the TLB flush.

include/linux/mm.h:gup_must_unshare() contains documentation.

Without GUP-fast, some things would be significantly easier to handle.


-- 
Thanks,

David / dhildenb

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help