Thread (13 messages) 13 messages, 4 authors, 2021-03-26

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

From: Hugh Dickins <hidden>
Date: 2021-03-26 04:06:06
Also in: linux-mm, lkml

On Fri, 26 Mar 2021, Matthew Wilcox wrote:
quoted hunk ↗ jump to hunk
On Thu, Mar 25, 2021 at 06:55:42PM -0700, Hugh Dickins wrote:
quoted
The first reason occurred to me this morning.  I thought I had been
clever to spot the PageHead race which you fix here.  But now I just feel
very stupid not to have spotted the very similar memcg_data race.  The
speculative racer may call mem_cgroup_uncharge() from __put_single_page(),
and the new call to split_page_memcg() do nothing because page_memcg(head)
is already NULL.

And is it even safe there, to sprinkle memcg_data through all of those
order-0 subpages, when free_the_page() is about to be applied to a
series of descending orders?  I could easily be wrong, but I think
free_pages_prepare()'s check_free_page() will find that is not
page_expected_state().
So back to something more like my original patch then?
+++ b/mm/page_alloc.c
@@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
 {
        if (put_page_testzero(page))
                free_the_page(page, order);
-	else if (!PageHead(page))
-               while (order-- > 0)
-                       free_the_page(page + (1 << order), order);
+       else if (!PageHead(page)) {
+               while (order-- > 0) {
+                       struct page *tail = page + (1 << order);
+#ifdef CONFIG_MEMCG
+                       tail->memcg_data = page->memcg_data;
+#endif
+                       free_the_page(tail, order);
+               }
+       }
 }
 EXPORT_SYMBOL(__free_pages);
We can cache page->memcg_data before calling put_page_testzero(),
just like we cache the Head flag in Johannes' patch.
If I still believed in e320d3012d25, yes, that would look right
(but I don't have much faith in my judgement after all this).

I'd fallen in love with split_page_memcg() when you posted that
one, and was put off by your #ifdef, so got my priorities wrong
and went for the split_page_memcg().
quoted
But, after all that, I'm now thinking that Matthew's original
e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
is safer reverted.  The put_page_testzero() in __free_pages() was
not introduced for speculative pagecache: it was there in 2.4.0,
and atomic_dec_and_test() in 2.2, I don't have older trees to hand.
I think you're confused in that last assertion.  According to
linux-fullhistory, the first introduction of __free_pages was 2.3.29pre3
(September 1999), where it did indeed use put_page_testzero:
Not confused, just pontificating from a misleading subset of the data.
I knew there's an even-more-history-than-tglx git tree somewhere, but
what I usually look back to is 2.4 trees, plus a 2.2.26 tree - but of
course that's a late 2.2, from 2004, around the same time as 2.6.3.
That tree shows a __free_pages() using atomic_dec_and_test().

But we digress...
+extern inline void __free_pages(struct page *page, unsigned long order)
+{
+       if (!put_page_testzero(page))
+               return;
+       __free_pages_ok(page, order);
+}

Before that, we had only free_pages() and __free_page().
quoted
So, it has "always" been accepted that multiple references to a
high-order non-compound page can be given out and released: maybe
they were all released with __free_pages() of the right order, or
maybe only the last had to get that right; but as __free_pages()
stands today, all but the last caller frees all but the first
subpage.  A very rare leak seems much safer.

I don't have the answer (find somewhere in struct page to squirrel
away the order, even when it's a non-compound page?), and I think
each of us would much rather be thinking about other things at the
moment.  But for now it looks to me like NAK to this patch, and
revert of e320d3012d25.
We did discuss that possibility prior to the introduction of
e320d3012d25.  Here's one such:
https://lore.kernel.org/linux-mm/20200922031215.GZ32101-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org/T/#m0b08c0c3430e09e20fa6648877dc42b04b18e6f3
Thanks for the link. And I'll willingly grant that your experience is
vast compared to mine. But "Drivers don't do that, in my experience"
is not a convincing reason to invalidate a way of working that the
code has gone out of its way to allow for, for over twenty years.

But you make a good point on the "Bad page" reports that would now
be generated: maybe that will change my mind later on.

Hugh
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help