Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.

From: Michal Hocko <mhocko@suse.com>
Date: 2021-09-15 12:06:54
Also in: linux-ext4, linux-fsdevel, linux-nfs, linux-xfs, lkml

On Wed 15-09-21 07:48:11, Neil Brown wrote:

On Wed, 15 Sep 2021, Mel Gorman wrote:

quoted

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:

quoted

Indefinite loops waiting for memory allocation are discouraged by
documentation in gfp.h which says the use of __GFP_NOFAIL that it

 is definitely preferable to use the flag rather than opencode endless
 loop around allocator.

Such loops that use congestion_wait() are particularly unwise as
congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice - and should be
deprecated.

So this patch changes the two loops in ext4_ext_truncate() to use
__GFP_NOFAIL instead of looping.

As the allocation is multiple layers deeper in the call stack, this
requires passing the EXT4_EX_NOFAIL flag down and handling it in various
places.

Of particular interest is the ext4_journal_start family of calls which
can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'.  This could be seen
as a blurring of types.  However 'type' is 8 bits, and EXT4_EX_NOFAIL is
a high bit, so it is safe in practice.

jbd2__journal_start() is enhanced so that the gfp_t flags passed are
used for *all* allocations.

Signed-off-by: NeilBrown <redacted>

I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
the risk of a livelock if memory is completely depleted where as some
callers can afford to wait.

Maybe we should wind back and focus on the documentation patches.
As quoted above, mm.h says:

quoted

 is definitely preferable to use the flag rather than opencode endless
 loop around allocator.

but you seem to be saying that is wrong.  I'd certainly like to get the
documentation right before changing any code.

Why does __GFP_NOFAIL access the reserves? Why not require that the
relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
with __GFP_NOFAIL if that is justified?

Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
memory reserves") help?

I would be worried to make the semantic even more complex than already
is. Access to memory reserves is an implementation detail that the page
allocator does currently. Callers shouldn't really be worried about
that. I do not ever remember any actual NOFAIL triggered memory
exhaustion. I have seen that to happen for unrestricted access to memory
reserves by OOM victim though. Hence cd04ae1e2dc8 ("mm, oom: do not rely
on TIF_MEMDIE for memory reserves access"). We can consider something
similar if NOFAIL allocation really tend to show a similar problem. We
do not want callers to care about OOM sitauations for this kind of
requests.

__GFP_NOFAIL | __GFP_HIGH is certainly something that is a valid usage
but I wouldn't base OOM behavior based on that.

There are over 100 __GFP_NOFAIL allocation sites.  I don't feel like
reviewing them all and seeing if any really need a try-harder flag.
Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then
#define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC)
and encourage the use of __GFP_NEVERFAIL in future?

Doesn't this add even more complexity?

When __GFP_NOFAIL loops, it calls congestion_wait() internally.  That
certainly needs to be fixed and the ideas you present below are
certainly worth considering when trying to understand how to address
that.  I'd rather fix it once there in page_alloc.c rather then export a
waiting API like congestion_wait().  That would provide more
flexibility.  e.g.  a newly freed page could be handed directly back to
the waiter.

Completely agreed here. We really do not want people to open code NOFAIL
unless they can do something really subsystem specific that would help
to make a forward progress.

-- 
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help