Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL

From: NeilBrown <hidden>
Date: 2021-10-22 00:10:00
Also in: linux-ext4, linux-fsdevel, linux-mm, linux-nfs, linux-xfs, lkml

On Wed, 20 Oct 2021, Michal Hocko wrote:

On Tue 19-10-21 15:32:27, Neil Brown wrote:

[clip looks of discussion where we are largely in agreement - happy to
 see that!]

quoted

Presumably there is a real risk of deadlock if we just remove the
memory-reserves boosts of __GFP_NOFAIL.  Maybe it would be safe to
replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH,
and then remove the __GFP_HIGH where analysis suggests there is no risk
of deadlocks.

I would much rather not bind those together and go other way around. If
somebody can actually hit deadlocks (those are quite easy to spot as
they do not go away) then we can talk about how to deal with them.
Memory reserves can help only > < this much.

I recall maybe 10 years ago Linus saying that he preferred simplicity to
mathematical provability for handling memory deadlocks (or something
like that).  I lean towards provability myself, but I do see the other
perspective.
We have mempools and they can provide strong guarantees (though they are
often over-allocated I think).  But they can be a bit clumsy.  I believe
that DaveM is strong against anything like that in the network layer, so
we strongly depend on GFP_MEMALLOC functionality for swap-over-NFS.  I'm
sure it is important elsewhere too.

Of course __GFP_HIGH and __GFP_ATOMIC provide an intermediate priority
level - more likely to fail than __GFP_MEMALLOC.  I suspect they should
not be seen as avoiding deadlock, only as improving service.  So using
them when we cannot wait might make sense, but there are probably other
circumstances.

quoted

Why is __GFP_NOFAIL better?

Because the allocator can do something if it knows that the allocation
cannot fail. E.g. give such an allocation a higher priority over those
that are allowed to fail. This is not limited to memory reserves,
although this is the only measure that is implemented currently IIRC.
On the other hand if there is something interesting the caller can do
directly - e.g. do internal object management like mempool does - then
it is better to retry at that level.

It *can* do something, but I don't think it *should* do something - not
if that could have a negative impact on other threads.  Just because I
cannot fail, that doesn't mean someone else should fail to help me.
Maybe I should just wait longer.

quoted

 * Using this flag for costly allocations is _highly_ discouraged.

This is unhelpful.  Saying something is "discouraged" carries an implied
threat.  This is open source and threats need to be open.
Why is it discouraged? IF it is not forbidden, then it is clearly
permitted.  Maybe there are costs  - so a clear statement of those costs
would be appropriate.
Also, what is a suitable alternative?

Current code will trigger a WARN_ON, so it is effectively forbidden.
Maybe we should document that __GFP_NOFAIL is forbidden for orders above
1, and that vmalloc() should be used instead (thanks for proposing that
patch!).

I think we want to recommend kvmalloc as an alternative once vmalloc is
NOFAIL aware.

I will skip over some of the specific regarding SLAB and NOFS usage if
you do not mind and focus on points that have direct documentation
consequences. Also I do not feel qualified commenting on neither SLAB
nor FS internals.

[...]

quoted

There is a lot of stuff there.... the bits that are important to me are:

 - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I
   don't see that it is necessary

I think it is preferred for one and a half reasons. It tells allocator
that this allocation cannot really fail and the caller doesn't have a
very good/clever retry policy (e.g. like mempools mentioned above). The
half reason would be for tracking purposes (git grep __GFP_NOFAIL) is
easier than trying to catch all sorts of while loops over allocation
which do not do anything really interesting.

I think the one reason is misguided, as described above.
I think the half reason is good, and that we should introduce
   memalloc_retry_wait()
and encourage developers to use that for any memalloc retry loop.
__GFP_NOFAIL would then be a convenience flag which causes the allocator
(slab or alloc_page or whatever) to call memalloc_retry_wait() and do
the loop internally.
What exactly memalloc_retry_wait() does (if anything) can be decided
separately and changed as needed.

quoted

 - is it reasonable to use __GFP_HIGH when looping if there is a risk of
   deadlock?

As I've said above. Memory reserves are a finite resource and as such
they cannot fundamentally solve deadlocks. They can help prioritize
though.

To be fair, they can solve 1 level of deadlock.  i.e. if you only need
to make one allocation to guarantee progress, then allocating from
reserves can help.  If you might need to make a second allocation
without freeing the first - then a single reserve pool can provide
guarantees (which is why we use mempool is layered block devices - md
over dm over loop of scsi).

quoted

 - Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In
   that case it should be safe to loop around allocations using
   __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can
   just be removed.

This is a good question and I do not think we have that documented
anywhere. We do cond_resched() for sure. I do not think we guarantee a
sleeping point in general. Maybe we should, I am not really sure.

If we add memalloc_retry_wait(), it wouldn't matter.  We would only need
to ensure that memalloc_retry_wait() waited if page_alloc didn't.


I think we should:
 - introduce memalloc_retry_wait() and use it for all malloc retry loops
   including __GFP_NOFAIL
 - drop all the priority boosts added for __GFP_NOFAIL
 - drop __GFP_ATOMIC and change all the code that tests for __GFP_ATOMIC
   to instead test for __GFP_HIGH.  __GFP_ATOMIC is NEVER used without
   __GFP_HIGH.  This give a slight boost to several sites that use
   __GFP_HIGH explicitly.
 - choose a consistent order threshold for disallowing __GFP_NOFAIL
   (rmqueue uses "order > 1", __alloc_pages_slowpath uses
    "order > PAGE_ALLOC_COSTLY_ORDER"), test it once - early - and
   document kvmalloc as an alternative.  Code can also loop if there
   is an alternative strategy for freeing up memory.

Thanks,
NeilBrown


Thanks,
NeilBrown

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help