Re: OOM detection regressions since 4.7

From: Michal Hocko <mhocko@kernel.org>
Date: 2016-08-24 07:05:09
Also in: lkml

On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:

Looks like my mail client eat my reply so I resend.

On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:

quoted

On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
[...]

quoted

Hello, Michal.

I agree with partial revert but revert should be a different form.
Below change try to reuse should_compact_retry() version for
!CONFIG_COMPACTION but it turned out that it also causes regression in
Markus report [1].

I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
order workloads that calling any change in that behavior a regression
is little bit exaggerated. Disabling compaction should have a very
strong reason. I haven't heard any so far. I am even wondering whether
there is a legitimate reason for that these days.

quoted

Theoretical reason for this regression is that it would stop retry
even if there are enough lru pages. It only checks if freepage
excesses min watermark or not for retry decision. To prevent
pre-mature OOM killer, we need to keep allocation loop when there are
enough lru pages. So, logic should be something like that.

should_compact_retry()
{
        for_each_zone_zonelist_nodemask {
                available = zone_reclaimable_pages(zone);
                available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
                if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
                        ac_classzone_idx(ac), alloc_flags, available))
                        return true;

        }
}

I suggested it before and current situation looks like it is indeed
needed.

this just opens doors for an unbounded reclaim/threshing becacause
you can reclaim as much as you like and there is no guarantee of a
forward progress. The reason why !COMPACTION should_compact_retry only
checks for the min_wmark without the reclaimable bias is that this will
guarantee a retry if we are failing due to high order wmark check rather
than a lack of memory. This condition is guaranteed to converge and the
probability of the unbounded reclaim is much more reduced.

In case of a lack of memory with a lot of reclaimable lru pages, why 
do we stop reclaim/compaction?

With your partial reverting patch, allocation logic would be like as
following.

Assume following situation:
o a lot of reclaimable lru pages
o no order-2 freepage
o not enough order-0 freepage for min watermark
o order-2 allocation

1. order-2 allocation failed due to min watermark
2. go to reclaim/compaction
3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
min watermark isn't met for order-0
4. compaction is skipped due to not enough freepage
5. should_reclaim_retry() returns false because min watermark for
order-2 page isn't met
6. should_compact_retry() returns false because min watermark for
order-0 page isn't met
6. allocation is failed without any retry and OOM is invoked.

If the direct reclaim is not able to get us over min wmark for order-0
then we would be likely to hit the oom even for order-0 requests.

Is it what you want?

And, please elaborate more on how your logic guarantee to converge.
After order-0 freepage exceed min watermark, there is no way to stop
reclaim/threshing. Number of freepage just increase monotonically and
retry cannot be stopped until order-2 allocation succeed. Am I missing
something?

My statement was imprecise at best. You are right that there is no
guarantee to fullfil order-2 request. What I meant to say is that we
should converge when we are getting out of memory (aka even order-0
would have hard time to succeed). should_reclaim_retry does that by
the back off scaling of the reclaimable pages. should_compact_retry
would have to do the same thing which would effectively turn it into
should_reclaim_retry.

quoted

And, I still think that your OOM detection rework has some flaws.

1) It doesn't consider freeable objects that can be freed by shrink_slab().
There are many subsystems that cache many objects and they will be
freed by shrink_slab() interface. But, you don't account them when
making the OOM decision.

I fully rely on the reclaim and compaction feedback. And that is the
place where we should strive for improvements. So if we are growing way
too many slab objects we should take care about that in the slab reclaim
which is tightly coupled with the LRU reclaim rather than up the layer
in the page allocator.

No. slab shrink logic which is tightly coupled with the LRU reclaim
totally makes sense.

Once the number of slab object is much larger than LRU pages (what we
have seen in some oom reports) then the way how they are coupled just
stops making a sense because the current approach no longer scales.  We
might not have cared before because we used to retry blindly.  At least
that is my understanding.

I am sorry to skip large parts of your email but I believe those things
have been discussed and we would just repeat here. I full understand
there are some disagreements between our views but I still maintain that
as long as we can handle not-so-crazy workloads I prefer determinism
over a blind retrying. It is to be expected that there will be some
regressions. It would be just too ideal to change one heuristic by
another and expect nobody will notice. But as long as we are able to fix
those issues without adding hacks on top of hacks then I think it is
worth pursuing this path. And so far the compaction changes which helped
to cover recent regressions are not hacks but rather long term way how
to change it from best effort to reliability behavior. As I've said
before, if this proves to be insufficient then I will definitely not
insist on the current approach and replace the compaction feedback by
something else. I do not have much idea by what because, yet again, this
is a heuristic and there is clearly not right thing to do (tm).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help