Re: Regression in mobility grouping?

From: Johannes Weiner <hannes@cmpxchg.org>
Date: 2016-09-28 15:39:44
Also in: lkml

Hi Vlastimil,

On Wed, Sep 28, 2016 at 11:00:15AM +0200, Vlastimil Babka wrote:

On 09/28/2016 03:41 AM, Johannes Weiner wrote:

quoted

Hi guys,

we noticed what looks like a regression in page mobility grouping
during an upgrade from 3.10 to 4.0. Identical machines, workloads, and
uptime, but /proc/pagetypeinfo on 3.10 looks like this:

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 1, zone   Normal          815          433        31518            2            0 

and on 4.0 like this:

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
Node 1, zone   Normal         3880         3530        25356            2            0            0

It's worth to keep in mind that this doesn't reflect where the actual
unmovable pages reside. It might be that in 3.10 they are spread within
the movable pages. IIRC enabling page_owner (not sure if in 4.0, there
were some later fixes I think) can augment pagetypeinfo with at least
some statistics of polluted pageblocks.

Thanks, I'll look at the mixed block counts. I failed to make clear,
we saw that issue in the switch from 3.10 to 4.0, and I mentioned
those two kernels as last known good / first known bad. But later
kernels - we tried with 4.6 - look the same. This appears to be a
regression in (higher-order) allocation service quality somewhere
after 3.10 that persists into current kernels.

Does e.g. /proc/meminfo suggest how much unmovable/reclaimable memory
there should be allocated and if it would fill the respective
pageblocks, or if they are poorly utilized?

They are very poorly utilized. On a machine with 90% anon/cache pages
alone we saw 50% of the page blocks unmovable.

quoted

4.0 is either polluting pageblocks more aggressively at allocation, or
is not able to make pageblocks movable again when the reclaimable and
unmovable allocations are released. Invoking compaction manually
(/proc/sys/vm/compact_memory) is not bringing them back, either.

The problem we are debugging is that these machines have a very high
rate of order-3 allocations (fdtable during fork, network rx), and
after the upgrade allocstalls have increased dramatically. I'm not
entirely sure this is the same issue, since even order-0 allocations
are struggling, but the mobility grouping in itself looks problematic.

I'm still going through the changes relevant to mobility grouping in
that timeframe, but if this rings a bell for anyone, it would help. I
hate blaming random patches, but these caught my eye:

9c0415e mm: more aggressive page stealing for UNMOVABLE allocations
3a1086f mm: always steal split buddies in fallback allocations
99592d5 mm: when stealing freepages, also take pages created by splitting buddy page

Check also the changelogs for mentions of earlier commits, e.g. 99592d5
should be restoring behavior that changed in 3.12-3.13 and you are
upgrading from 3.10.

Good point.

quoted

The changelog states that by aggressively stealing split buddy pages
during a fallback allocation we avoid subsequent stealing. But since
there are generally more movable/reclaimable pages available, and so
less falling back and stealing freepages on behalf of movable, won't
this mean that we could expect exactly that result - growing numbers
of unmovable blocks, while rarely stealing them back in movable alloc
fallbacks? And the expansion of !MOVABLE blocks would over time make
compaction less and less effective too, seeing as it doesn't consider
anything !MOVABLE suitable migration targets?

Yeah this is an issue with compaction that was brought up recently and I
want to tackle next.

Agreed, it would be nice if compaction could reclaim unmovable and
reclaimable blocks whose polluting allocations have since been freed.

But there is a limit to how lazy mobility grouping can be and still
expect compaction to fix it up. If 50% of the page blocks are marked
unmovable, we don't pack incoming polluting allocations. When spread
out the right way, even just a few of those can have a devastating
impact on overall compactability.

So regardless of future compaction improvements, we need to get
anti-frag accuracy in the allocator closer to 3.10 levels again.

quoted

Attached are the full /proc/pagetypeinfo and /proc/buddyinfo from both
kernels on machines with similar uptimes and directly after invoking
compaction. As you can see, the buddy lists are much more fragmented
on 4.0, with unmovable/reclaimable allocations polluting more blocks.

Any thoughts on this would be greatly appreciated. I can test patches.

I guess testing revert of 9c0415e could give us some idea. Commit
3a1086f shouldn't result in pageblock marking differences and as I said
above, 99592d5 should be just restoring to what 3.10 did.

I can give this a shot, but note that this commit makes only unmovable
stealing more aggressive. We see reclaimable blocks up as well.

The workload is fairly variable, so it'll take about a day to smooth
out a meaningful average.

Thanks for your insights, Vlastimil!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help