Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM
From: Michal Hocko <mhocko@suse.com>
Date: 2024-09-03 07:06:19
Also in:
linux-bcachefs, linux-fsdevel, linux-mm, lkml
On Mon 02-09-24 18:32:33, Kent Overstreet wrote:
On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote:quoted
On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet [off-list ref] wrote:quoted
On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote:quoted
The previous version has been posted in [1]. Based on the review feedback I have sent v2 of patches in the same threat but it seems that the review has mostly settled on these patches. There is still an open discussion on whether having a NORECLAIM allocator semantic (compare to atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but those are not really relevant to this particular patchset as it 1) doesn't aim to implement either of the two and 2) it aims at spreading PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined semantic now that it is not widely used and much harder to fix. I have collected Reviewed-bys and reposting here. These patches are touching bcachefs, VFS and core MM so I am not sure which tree to merge this through but I guess going through Andrew makes the most sense. Changes since v1; - compile fixes - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested by Matthew.To reiterate:It would be helpful to summarize your concerns. What runtime impact do you expect this change will have upon bcachefs?For bcachefs: I try really hard to minimize tail latency and make performance robust in extreme scenarios - thrashing. A large part of that is that btree locks must be held for no longer than necessary. We definitely don't want to recurse into other parts of the kernel, taking other locks (i.e. in memory reclaim) while holding btree locks; that's a great way to stack up (and potentially multiply) latencies.
OK, these two patches do not fail to do that. The only existing user is turned into GFP_NOWAIT so the final code works the same way. Right?
But gfp flags don't work with vmalloc allocations (and that's unlikely to change), and we require vmalloc fallbacks for e.g. btree node allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM.
Have you even tried to reach out to vmalloc maintainers and asked for GFP_NOWAIT support for vmalloc? Because I do not remember that. Sure kernel page tables are have hardcoded GFP_KERNEL context which slightly complicates that but that doesn't really mean the only potential solution is to use a per task flag to override that. Just from top of my head we can consider pre-allocating virtual address space for non-sleeping allocations. Maybe there are other options that only people deeply familiar with the vmalloc internals can see. This requires discussions not pushing a very particular solution through. -- Michal Hocko SUSE Labs