Re: [PATCH] memcg: charge before adding to swapcache on swapin

From: Johannes Weiner <hidden>
Date: 2021-02-20 00:36:12
Also in: linux-mm, lkml

On Fri, Feb 19, 2021 at 02:44:05PM -0800, Shakeel Butt wrote:

Currently the kernel adds the page, allocated for swapin, to the
swapcache before charging the page. This is fine but now we want a
per-memcg swapcache stat which is essential for folks who wants to
transparently migrate from cgroup v1's memsw to cgroup v2's memory and
swap counters.

To correctly maintain the per-memcg swapcache stat, one option which
this patch has adopted is to charge the page before adding it to
swapcache. One challenge in this option is the failure case of
add_to_swap_cache() on which we need to undo the mem_cgroup_charge().
Specifically undoing mem_cgroup_uncharge_swap() is not simple.

This patch circumvent this specific issue by removing the failure path
of  add_to_swap_cache() by providing __GFP_NOFAIL. Please note that in
this specific situation ENOMEM was the only possible failure of
add_to_swap_cache() which is removed by using __GFP_NOFAIL.

Another option was to use __mod_memcg_lruvec_state(NR_SWAPCACHE) in
mem_cgroup_charge() but then we need to take of the do_swap_page() case
where synchronous swap devices bypass the swapcache. The do_swap_page()
already does hackery to set and reset PageSwapCache bit to make
mem_cgroup_charge() execute the swap accounting code and then we would
need to add additional parameter to tell to not touch NR_SWAPCACHE stat
as that code patch bypass swapcache.

This patch added memcg charging API explicitly foe swapin pages and
cleaned up do_swap_page() to not set and reset PageSwapCache bit.

Signed-off-by: Shakeel Butt <redacted>

The patch makes sense to me. While it extends the charge interface, I
actually quite like that it charges the page earlier - before putting
it into wider circulation. It's a step in the right direction.

But IMO the semantics of mem_cgroup_charge_swapin_page() are a bit too
fickle: the __GFP_NOFAIL in add_to_swap_cache() works around it, but
having a must-not-fail-after-this line makes the code tricky to work
on and error prone.

It would be nicer to do a proper transaction sequence.

quoted hunk ↗ jump to hunk

@@ -497,16 +497,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
 
-	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) {
-		put_swap_page(page, entry);
+	if (mem_cgroup_charge_swapin_page(page, NULL, gfp_mask, entry))
 		goto fail_unlock;
-	}
 
-	if (mem_cgroup_charge(page, NULL, gfp_mask)) {
-		delete_from_swap_cache(page);
-		goto fail_unlock;
-	}
+	/*
+	 * Use __GFP_NOFAIL to not worry about undoing the changes done by
+	 * mem_cgroup_charge_swapin_page() on failure of add_to_swap_cache().
+	 */
+	add_to_swap_cache(page, entry,
+			  (gfp_mask|__GFP_NOFAIL) & GFP_RECLAIM_MASK, &shadow);

How about:

	mem_cgroup_charge_swapin_page()
	add_to_swap_cache()
	mem_cgroup_finish_swapin_page()

where finish_swapin_page() only uncharges the swap entry (on cgroup1)
once the swap->memory transition is complete?

Otherwise the patch looks good to me.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help