Thread (161 messages) 161 messages, 7 authors, 2013-07-25

Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

From: Michal Hocko <hidden>
Date: 2012-11-29 14:05:55
Also in: cgroups, lkml

On Wed 28-11-12 12:20:44, Hugh Dickins wrote:
[...]
Sorry, Michal, you've laboured hard on this: but I dislike it so much
that I'm here overcoming my dread of entering an OOM-killer discussion,
and the resultant deluge of unwelcome CCs for eternity afterwards.

I had been relying on Johannes to repeat his "This issue has been
around for a while so frankly I don't think it's urgent enough to
rush things", but it looks like I have to be the one to repeat it.
Well, the idea was to use this only as a temporal fix and come up with a
better solution without any hurry.
Your analysis of azurIt's traces may well be correct, and this patch
may indeed ameliorate the situation, and it's fine as something for
azurIt to try and report on and keep in his tree; but I hope that
it does not go upstream and to stable.

Why do I dislike it so much?  I suppose because it's both too general
and too limited at the same time.

Too general in that it changes the behaviour on OOM for a large set
of memcg charges, all those that go through add_to_page_cache_locked(),
when only a subset of those have the i_mutex issue.
This is a fair point but the real fix which we were discussing with
Johannes would be even more risky for stable.
If you're going to be that general, why not go further?  Leave the
mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c.  No other
filesystem gets the benefit of those distinctions: isn't it better to
keep it simple?  (And I can see a partial truncation case where shmem
uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
is a non-issue, since swapoff invites itself to be killed anyway.)

Too limited in that i_mutex is just the held resource which azurIt's
traces have led you to, but it's a general problem that the OOM-killed
task might be waiting for a resource that the OOM-killing task holds.

I suspect that if we try hard enough (I admit I have not), we can find
an example of such a potential deadlock for almost every memcg charge
site.  mmap_sem? not as easy to invent a case with that as I thought,
since it needs a down_write, and the typical page allocations happen
with down_read, and I can't think of a process which does down_write
on another's mm.

But i_mutex is always good, once you remember the case of write to
file from userspace page which got paged out, so the fault path has
to read it back in, while i_mutex is still held at the outer level.
An unusual case?  Well, normally yes, but we're considering
out-of-memory conditions, which may converge upon cases like this.

Wouldn't it be nice if I could be constructive?  But I'm sceptical
even of Johannes's faith in what the global OOM killer would do:
how does __alloc_pages_slowpath() get out of its "goto restart"
loop, excepting the trivial case when the killer is the killed?
I am not sure I am following you here but the Johannes's idea was to
break out of the charge after a signal has been sent and the charge
still fails and either retry the fault or fail the allocation. I think
this should work but I am afraid that this needs some tuning (number of
retries f.e.) to prevent from too aggressive OOM or too many failurs.

Do we have any other possibilities to solve this issue? Or do you think
we should ignore the problem just because nobody complained for such a
long time?
Dunno, I think we should fix this with something less risky for now and
come up with a real fix after it sees sufficient testing.
I wonder why this issue has hit azurIt and no other reporter?
No swap plays a part in it, but that's not so unusual.

Yours glOOMily,
Hugh
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help