Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM

From: Michal Hocko <mhocko@suse.com>
Date: 2024-09-05 11:26:52
Also in: linux-bcachefs, linux-fsdevel, linux-mm, lkml

On Wed 04-09-24 14:03:13, Kent Overstreet wrote:

On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote:

quoted

On Wed 04-09-24 12:05:56, Kent Overstreet wrote:

quoted

On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote:

quoted

On Tue 03-09-24 19:53:41, Kent Overstreet wrote:
[...]

quoted

However, if we agreed that GFP_NOFAIL meant "only fail if it is not
possible to satisfy this allocation" (and I have been arguing that that
is the only sane meaning) - then that could lead to a lot of error paths
getting simpler.

Because there are a lot of places where there's essentially no good
reason to bubble up an -ENOMEM to userspace; if we're actually out of
memory the current allocation is just one out of many and not
particularly special, better to let the oom killer handle it...

This is exactly GFP_KERNEL semantic for low order allocations or
kvmalloc for that matter. They simply never fail unless couple of corner
cases - e.g. the allocating task is an oom victim and all of the oom
memory reserves have been consumed. This is where we call "not possible
to allocate".

*nod*

Which does beg the question of why GFP_NOFAIL exists.

Exactly for the reason that even rare failure is not acceptable and
there is no way to handle it other than keep retrying. Typical code was 
	while (!(ptr = kmalloc()))
		;

But is it _rare_ failure, or _no_ failure?

You seem to be saying (and I just reviewed the code, it looks like
you're right) that there is essentially no difference in behaviour
between GFP_KERNEL and GFP_NOFAIL.

The fundamental difference is that (appart from unsupported allocation
mode/size) the latter never returns NULL and you can rely on that fact.
Our docummentation says:
 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless.

So given that - why the wart?

I think we might be able to chalk it up to history; I'd have to go
spunking through the history (or ask Dave or Ted, maybe they'll chime
in), but I suspect GFP_KERNEL didn't provide such strong guarantees when
the allocation loops & GFP_NOFAIL were introduced.

Sure, go ahead. If you manage to remove all existing users of
__GFP_NOFAIL (without replacing them with retry loops in the caller)
then I would be more than happy to remove __GFP_NOFAIL in the allocator. 

[...]

quoted

But the point is there are some which _do_ need this. We have discussed
that in other email thread where you have heard why XFS and EXT4 does
that and why they are not going to change that model.

No, I agree that they need the strong guarantees.

But if there's an actual bug, returning an error is better than killing
the task. Killing the task is really bad; these allocations are deep in
contexts where locks and refcounts are held, and the system will just
grind to a halt.

Not sure what you mean by these allocations but I am not aware that any
of the existing user would be really buggy. Also as I've said elsewhere,
there is simply no good way to handle a buggy caller. Killing the buggy
context has some downsides, returning NULL has others. I have argued
that the former has better predictable behavior than potentially silent
failure. We can clearly disagree on this but I also do not see why that
is relevant to the original discussion because my argument against
PF_MEMALLOC_NORECLAIM was focused on correct GPF_NOFAIL nested context
that would get an unexpected failure mode. No matter what kind of
failure mode that is it would be unexpected for those users.

quoted

But as a matter of policy going forward, yes we should be saying that
even GFP_NOFAIL allocations should be checking for -ENOMEM.

I argue that such NOFAIL semantic has no well defined semantic and legit
users are forced to do
	while (!(ptr = kmalloc(GFP_NOFAIL))) ;
or
	BUG_ON(!(ptr = kmalloc(GFP_NOFAIL)));

So it has no real reason to exist.

I'm arguing that it does, provided when it returns NULL is defined to
be:
 - invalid allocation context
 - a size that is so big that it will never be possible to satisfy.

Those are not really important situations because you are arguing about
a buggy code that needs fixing. As said above we can argue how to deal
with those users to get a predictable behavior but as the matter of
fact, correct users can expect never seeing the failure so handling
failure might be a) impossible and b) unfeasible (i.e. you are adding a
dead code that is never tested).

[...]

For large allocations in bcachefs: in journal replay we read all the
keys in the journal, and then we create a big flat array with references
to all of those keys to sort and dedup them.

We haven't hit the INT_MAX size limit there yet, but filesystem sizes
being what they are, we will soon. I've heard of users with 150 TB
filesystems, and once the fsck scalability issues are sorted we'll be
aiming for petabytes. Dirty keys in the journal scales more with system
memory, but I'm leasing machines right now with a quarter terabyte of
ram.

I thought you were arguing about bcachefs handling failure mode so
presumably you do not need to use __GFP_NOFAIL for those.

I am sorry but I am getting lost in these arguments.
-- 
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help