Re: [PATCH -mm v15 00/13] kmemcg shrinkers

From: Vladimir Davydov <hidden>
Date: 2014-02-13 17:33:38
Also in: lkml

On 02/13/2014 02:01 AM, Johannes Weiner wrote:

On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:

quoted

On 02/12/2014 12:19 AM, Johannes Weiner wrote:

quoted

On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:

quoted

Hi Michal, Johannes, David,

Could you please take a look at this if you have time? Without your
review, it'll never get committed.

There is simply no review bandwidth for new features as long as we are
fixing fundamental bugs in memcg.

quoted

On 02/05/2014 10:39 PM, Vladimir Davydov wrote:

quoted

Hi,

This is the 15th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
some vmscan cleanups that I need committed there) and organized as follows:
 - patches 1-4 introduce some minor changes to memcg needed for this set;
 - patches 5-7 prepare fs for per-memcg list_lru;
 - patch 8 implement kmemcg reclaim core;
 - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
 - patch 10 is trivial - it issues shrinkers on memcg destruction;
 - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
   memcg destruction.

In the context of the ongoing discussions about charge reparenting I
was curious how you deal with charges becoming unreclaimable after a
memcg has been offlined.

Patch #11 drops all charged objects at offlining by just invoking
shrink_slab() in a loop until "only a few" (10) objects are remaining.
How long is this going to take?  And why is it okay to destroy these
caches when somebody else might still be using them?

IMHO, on container destruction we have to drop as many objects accounted
to this container as we can, because otherwise any container will be
able to get access to any number of unaccounted objects by fetching them
and then rebooting.

They're accounted to and subject to the limit of the parent.  I don't
see how this is different than page cache.

quoted

That still leaves you with the free objects that slab caches retain
for allocation efficiency, so now you put all dead memcgs in the
system on a global list, and on a vmpressure event on root_mem_cgroup
you walk the global list and drain the freelist of all remaining
caches.

This is a lot of complexity and scalability problems for less than
desirable behavior.

Please think about how we can properly reparent kmemcg charges during
memcg teardown.  That would simplify your code immensely and help
clean up this unholy mess of css pinning.

Slab caches are already collected in the memcg and on destruction
could be reassigned to the parent.  Kmemcg uncharge from slab freeing
would have to be updated to use the memcg from the cache, not from the
individual page, but I don't see why this wouldn't work right now.

I don't think I understand what you mean by reassigning slab caches to
the parent.

If you mean moving all pages (slabs) from the cache of the memcg being
destroyed to the corresponding root cache (or the parent memcg's cache)
and then destroying the memcg's cache, I don't think this is feasible,
because slub free's fast path is lockless, so AFAIU we can't remove a
partial slab from a cache w/o risking to race with kmem_cache_free.

If you mean clearing all pointers from the memcg's cache to the memcg
(changing them to the parent or root memcg), then AFAIU this won't solve
the problem with "dangling" caches - we will still have to shrink them
on vmpressure. So although this would allow us to put the reference to
the memcg from kmem caches on memcg's death, it wouldn't simplify the
code at all, in fact, it would even make it more complicated, because we
would have to handle various corner cases like reparenting vs
list_lru_{add,remove}.

I think we have different concepts of what's complicated.  There is an
existing model of what to do with left-over cache memory when a cgroup
is destroyed, which is reparenting.  The rough steps will be the same,
the object lifetime will be the same, the css refcounting will be the
same, the user-visible behavior will be the same.  Any complexity from
charge vs. reparent races will be contained to a few lines of code.

Weird refcounting tricks during offline, trashing kmem caches instead
of moving them to the parent like other memory, a global list of dead
memcgs and sudden freelist thrashing on a vmpressure event, that's what
adds complexity and what makes this code unpredictable, fragile, and
insanely hard to work with.  It's not acceptable.

By reparenting I meant reassigning the memcg cache parameter pointer
from the slab cache such that it points to the parent.  This should be
an atomic operation.  All css lookups already require RCU (I think slab
does not follow this yet because we guarantee that css reference, but
it should be changed).  So switch the cache param pointer, insert an
RCU graceperiod to wait for all the ongoing charges and uncharges until
nobody sees the memcg anymore, then safely reparent all the remaining
memcg objects to the parent.  Maybe individually, maybe we can just
splice the lists to the parent's list_lru lists.

But what should we do with objects that do not reside on any list_lru?
How can we reparent them?

Let's forget about per-memcg list_lru for a while, because
implementation of kmem reparenting requires rework of the existing code
handling per-memcg kmem caches. I would really appreciate if you could
share your vision on how we should get rid of per-memcg caches on memcg
destruction. Currently they are left hanging around until all memcg's
objects are freed. Should we try to move individual slab pages to the
parent's cache and destroy the memcg's cache? Or perhaps you mean that
all accounted kmem objects should be tracked in memcg by some means
(list_lru in case dcache/icache)?

Please excuse me if I annoy you, but before trying to change anything
I'd like to ensure I understand you right.

Thank you.

I'm not sure I understand how the dangling cache problem pertains to
this, isn't this an entirely separate issue?

quoted

Charged thread stack pages could be reassigned when the task itself is
migrated out of a cgroup.

Thread info pages are only a part of the problem. If a process kmalloc's
an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
page accounted to kmemcg, and we won't be able to find this page given
the memcg it is accounted to (except for walking the whole page range).
Thus we will have to organize those pages in per-memcg lists, won't we?
Again, even more complexity.

Why do we track them in the first place?  We don't track any random
page allocation, so we shouldn't track kmalloc() that falls back to the
page allocator.  In fact we shouldn't track any random slab allocation.

The types of allocations we primarily want to track are the ones that
directly scale with user behavior.  This is a finite set, which on a
global level is covered mostly by ulimits.  After all, an unprivileged
user interfering with other users is not a new problem and existed long
before memcg.

It was a mistake to provide __GFP_KMEMCG and allow charging any random
allocation, without giving memcg the means to actually manage that
memory.  I don't see that such flexibility even needed, and it clearly
hurts us now.  It was a choice we made to keep things simple in the
beginning, before we knew how all this is going to turn out.  We should
rectify this mistake before building on top of it.

Here is the much bigger issue behind this:

Memcg should be a thin layer of accounting and limiting between the VM
and cgroup core code, but look at the line count.  It's more code than
all of the page reclaim logic combined, including the page replacement
algorithm, rmap, LRU list handling - all of which already include a
good deal of memcg specifics.  It's the same size as the scheduler
core, which includes the entire cpu cgroup controller.  And all this
other code is of better quality and has more eyes on it than memcg.

Please demonstrate that you try to see the bigger picture behind memcg
and make an effort to keep things simple beyond the code you introduce
and the niche you care about, otherwise I'm not willing to take any
patches from you that don't straight-up delete stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help