Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
From: NeilBrown <hidden>
Date: 2021-10-22 00:10:00
Also in:
linux-ext4, linux-fsdevel, linux-mm, linux-nfs, linux-xfs, lkml
On Wed, 20 Oct 2021, Michal Hocko wrote:
On Tue 19-10-21 15:32:27, Neil Brown wrote:
[clip looks of discussion where we are largely in agreement - happy to see that!]
quoted
Presumably there is a real risk of deadlock if we just remove the memory-reserves boosts of __GFP_NOFAIL. Maybe it would be safe to replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH, and then remove the __GFP_HIGH where analysis suggests there is no risk of deadlocks.I would much rather not bind those together and go other way around. If somebody can actually hit deadlocks (those are quite easy to spot as they do not go away) then we can talk about how to deal with them. Memory reserves can help only > < this much.
I recall maybe 10 years ago Linus saying that he preferred simplicity to mathematical provability for handling memory deadlocks (or something like that). I lean towards provability myself, but I do see the other perspective. We have mempools and they can provide strong guarantees (though they are often over-allocated I think). But they can be a bit clumsy. I believe that DaveM is strong against anything like that in the network layer, so we strongly depend on GFP_MEMALLOC functionality for swap-over-NFS. I'm sure it is important elsewhere too. Of course __GFP_HIGH and __GFP_ATOMIC provide an intermediate priority level - more likely to fail than __GFP_MEMALLOC. I suspect they should not be seen as avoiding deadlock, only as improving service. So using them when we cannot wait might make sense, but there are probably other circumstances.
quoted
Why is __GFP_NOFAIL better?Because the allocator can do something if it knows that the allocation cannot fail. E.g. give such an allocation a higher priority over those that are allowed to fail. This is not limited to memory reserves, although this is the only measure that is implemented currently IIRC. On the other hand if there is something interesting the caller can do directly - e.g. do internal object management like mempool does - then it is better to retry at that level.
It *can* do something, but I don't think it *should* do something - not if that could have a negative impact on other threads. Just because I cannot fail, that doesn't mean someone else should fail to help me. Maybe I should just wait longer.
quoted
quoted
* Using this flag for costly allocations is _highly_ discouraged.This is unhelpful. Saying something is "discouraged" carries an implied threat. This is open source and threats need to be open. Why is it discouraged? IF it is not forbidden, then it is clearly permitted. Maybe there are costs - so a clear statement of those costs would be appropriate. Also, what is a suitable alternative? Current code will trigger a WARN_ON, so it is effectively forbidden. Maybe we should document that __GFP_NOFAIL is forbidden for orders above 1, and that vmalloc() should be used instead (thanks for proposing that patch!).I think we want to recommend kvmalloc as an alternative once vmalloc is NOFAIL aware. I will skip over some of the specific regarding SLAB and NOFS usage if you do not mind and focus on points that have direct documentation consequences. Also I do not feel qualified commenting on neither SLAB nor FS internals. [...]quoted
There is a lot of stuff there.... the bits that are important to me are: - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I don't see that it is necessaryI think it is preferred for one and a half reasons. It tells allocator that this allocation cannot really fail and the caller doesn't have a very good/clever retry policy (e.g. like mempools mentioned above). The half reason would be for tracking purposes (git grep __GFP_NOFAIL) is easier than trying to catch all sorts of while loops over allocation which do not do anything really interesting.
I think the one reason is misguided, as described above. I think the half reason is good, and that we should introduce memalloc_retry_wait() and encourage developers to use that for any memalloc retry loop. __GFP_NOFAIL would then be a convenience flag which causes the allocator (slab or alloc_page or whatever) to call memalloc_retry_wait() and do the loop internally. What exactly memalloc_retry_wait() does (if anything) can be decided separately and changed as needed.
quoted
- is it reasonable to use __GFP_HIGH when looping if there is a risk of deadlock?As I've said above. Memory reserves are a finite resource and as such they cannot fundamentally solve deadlocks. They can help prioritize though.
To be fair, they can solve 1 level of deadlock. i.e. if you only need to make one allocation to guarantee progress, then allocating from reserves can help. If you might need to make a second allocation without freeing the first - then a single reserve pool can provide guarantees (which is why we use mempool is layered block devices - md over dm over loop of scsi).
quoted
- Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In that case it should be safe to loop around allocations using __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can just be removed.This is a good question and I do not think we have that documented anywhere. We do cond_resched() for sure. I do not think we guarantee a sleeping point in general. Maybe we should, I am not really sure.
If we add memalloc_retry_wait(), it wouldn't matter. We would only need
to ensure that memalloc_retry_wait() waited if page_alloc didn't.
I think we should:
- introduce memalloc_retry_wait() and use it for all malloc retry loops
including __GFP_NOFAIL
- drop all the priority boosts added for __GFP_NOFAIL
- drop __GFP_ATOMIC and change all the code that tests for __GFP_ATOMIC
to instead test for __GFP_HIGH. __GFP_ATOMIC is NEVER used without
__GFP_HIGH. This give a slight boost to several sites that use
__GFP_HIGH explicitly.
- choose a consistent order threshold for disallowing __GFP_NOFAIL
(rmqueue uses "order > 1", __alloc_pages_slowpath uses
"order > PAGE_ALLOC_COSTLY_ORDER"), test it once - early - and
document kvmalloc as an alternative. Code can also loop if there
is an alternative strategy for freeing up memory.
Thanks,
NeilBrown
Thanks,
NeilBrown