Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim

From: Huang, Ying <hidden>
Date: 2022-11-30 03:56:12
Also in: cgroups, lkml

Johannes Weiner [off-list ref] writes:

Hello Ying,

On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:

quoted

Johannes Weiner [off-list ref] writes:

quoted

The fallback to reclaim actually strikes me as wrong.

Think of reclaim as 'demoting' the pages to the storage tier. If we
have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
CXL and from CXL to storage. If we reclaim a page from RAM, it means
we 'demote' it directly from RAM to storage, bypassing potentially a
huge amount of pages colder than it in CXL. That doesn't seem right.

If demotion fails, IMO it shouldn't satisfy the reclaim request by
breaking the layering. Rather it should deflect that pressure to the
lower layers to make room. This makes sure we maintain an aging
pipeline that honors the memory tier hierarchy.

Yes.  I think that we should avoid to fall back to reclaim as much as
possible too.  Now, when we allocate memory for demotion
(alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
kswapd reclaim on lower tier node to free some memory to avoid fall back
to reclaim on current (higher tier) node.  This may be not good enough,
for example, the following patch from Hasan may help via waking up
kswapd earlier.

https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ (local)

Do you know what is the next step plan for this patch?

Should we do even more?

From another point of view, I still think that we can use falling back
to reclaim as the last resort to avoid OOM in some special situations,
for example, most pages in the lowest tier node are mlock() or too hot
to be reclaimed.

If they're hotter than reclaim candidates on the toptier, shouldn't
they get promoted instead and make room that way? We may have to tweak
the watermark logic a bit to facilitate that (allow promotions where
regular allocations already fail?). But this sort of resorting would
be preferable to age inversions.

Now it's legal to enable demotion and disable promotion.  Yes, this is
wrong configuration in general.  But should we trigger OOM for these
users?

And now promotion only works for default NUMA policy (and MPOL_BIND to
both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
we use some other NUMA policy, the pages cannot be promoted too.

The mlock scenario sounds possible. In that case, it wouldn't be an
aging inversion, since there is nothing colder on the CXL node.

Maybe a bypass check should explicitly consult the demotion target
watermarks against its evictable pages (similar to the file_is_tiny
check in prepare_scan_count)?

Yes.  This sounds doable.

Because in any other scenario, if there is a bug in the promo/demo
coordination, I think we'd rather have the OOM than deal with age
inversions causing intermittent performance issues that are incredibly
hard to track down.

Previously, I thought that people will always prefer performance
regression than OOM.  Apparently, I am wrong.

Anyway, I think that we need to reduce the possibility of OOM or falling
back to reclaim as much as possible firstly.  Do you agree?

One possibility, can we fall back to reclaim only if the sc->priority is
small enough (even 0)?

Best Regards,
Huang, Ying

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help