Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

From: Dave Hansen <hidden>
Date: 2021-12-17 18:26:44
Also in: linux-arch, linux-mm, linux-s390, lkml, sparclinux

On 12/17/21 12:19 AM, Nikita Yushchenko wrote:

When batched page table freeing via struct mmu_table_batch is used, the
final freeing in __tlb_remove_table_free() executes a loop, calling
arch hook __tlb_remove_table() to free each table individually.

Shift that loop down to archs. This allows archs to optimize it, by
freeing multiple tables in a single release_pages() call. This is
faster than individual put_page() calls, especially with memcg
accounting enabled.

Could we quantify "faster"?  There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.

quoted hunk ↗ jump to hunk

--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c

@@ -95,11 +95,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 
 static void __tlb_remove_table_free(struct mmu_table_batch *batch)
 {
-	int i;
-
-	for (i = 0; i < batch->nr; i++)
-		__tlb_remove_table(batch->tables[i]);
-
+	__tlb_remove_tables(batch->tables, batch->nr);
 	free_page((unsigned long)batch);
 }

This leaves a single call-site for __tlb_remove_table():

static void tlb_remove_table_one(void *table)
{
        tlb_remove_table_sync_one();
        __tlb_remove_table(table);
}

Is that worth it, or could it just be:

	__tlb_remove_tables(&table, 1);

?

-void free_pages_and_swap_cache(struct page **pages, int nr)
+static void __free_pages_and_swap_cache(struct page **pages, int nr,
+		bool do_lru)
 {
-	struct page **pagep = pages;
 	int i;
 
-	lru_add_drain();
+	if (do_lru)
+		lru_add_drain();
 	for (i = 0; i < nr; i++)
-		free_swap_cache(pagep[i]);
-	release_pages(pagep, nr);
+		free_swap_cache(pages[i]);
+	release_pages(pages, nr);
+}
+
+void free_pages_and_swap_cache(struct page **pages, int nr)
+{
+	__free_pages_and_swap_cache(pages, nr, true);
+}
+
+void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
+{
+	__free_pages_and_swap_cache(pages, nr, false);
 }

This went unmentioned in the changelog.  But, it seems like there's a
specific optimization here.  In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU.  It doesn't need the lru_add_drain().

Any code that knows it is freeing all non-LRU pages can call
free_pages_and_swap_cache_nolru() which should perform better than
free_pages_and_swap_cache().

Should we add this to the for loop in __free_pages_and_swap_cache()?

	for (i = 0; i < nr; i++) {
		if (!do_lru)
			VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
					     pagep[i]);
		free_swap_cache(...);
	}

But, even more than that, do all the architectures even need the
free_swap_cache()?  PageSwapCache() will always be false on x86, which
makes the loop kinda silly.  x86 could, for instance, just do:

static inline void __tlb_remove_tables(void **tables, int nr)
{
	release_pages((struct page **)tables, nr);
}

I _think_ this will work everywhere that has whole pages as page tables.
 Taking that one step further, what if we only had one generic:

static inline void tlb_remove_tables(void **tables, int nr)
{
	int i;

#ifdef ARCH_PAGE_TABLES_ARE_FULL_PAGE
	release_pages((struct page **)tables, nr);
#else
	arch_tlb_remove_tables(tables, i);
#endif
}

Architectures that set ARCH_PAGE_TABLES_ARE_FULL_PAGE (or whatever)
don't need to implement __tlb_remove_table() at all *and* can do
release_pages() directly.

This avoids all the  confusion with the swap cache and LRU naming.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help