Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup
From: Matthew Wilcox <hidden>
Date: 2021-03-26 02:53:44
Also in:
linux-mm, lkml
On Thu, Mar 25, 2021 at 06:55:42PM -0700, Hugh Dickins wrote:
The first reason occurred to me this morning. I thought I had been clever to spot the PageHead race which you fix here. But now I just feel very stupid not to have spotted the very similar memcg_data race. The speculative racer may call mem_cgroup_uncharge() from __put_single_page(), and the new call to split_page_memcg() do nothing because page_memcg(head) is already NULL. And is it even safe there, to sprinkle memcg_data through all of those order-0 subpages, when free_the_page() is about to be applied to a series of descending orders? I could easily be wrong, but I think free_pages_prepare()'s check_free_page() will find that is not page_expected_state().
So back to something more like my original patch then?
+++ b/mm/page_alloc.c@@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order) { if (put_page_testzero(page)) free_the_page(page, order); - else if (!PageHead(page)) - while (order-- > 0) - free_the_page(page + (1 << order), order); + else if (!PageHead(page)) { + while (order-- > 0) { + struct page *tail = page + (1 << order); +#ifdef CONFIG_MEMCG + tail->memcg_data = page->memcg_data; +#endif + free_the_page(tail, order); + } + } } EXPORT_SYMBOL(__free_pages);
We can cache page->memcg_data before calling put_page_testzero(), just like we cache the Head flag in Johannes' patch.
But, after all that, I'm now thinking that Matthew's original
e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
is safer reverted. The put_page_testzero() in __free_pages() was
not introduced for speculative pagecache: it was there in 2.4.0,
and atomic_dec_and_test() in 2.2, I don't have older trees to hand.
I think you're confused in that last assertion. According to
linux-fullhistory, the first introduction of __free_pages was 2.3.29pre3
(September 1999), where it did indeed use put_page_testzero:
+extern inline void __free_pages(struct page *page, unsigned long order)
+{
+ if (!put_page_testzero(page))
+ return;
+ __free_pages_ok(page, order);
+}
Before that, we had only free_pages() and __free_page().
So, it has "always" been accepted that multiple references to a high-order non-compound page can be given out and released: maybe they were all released with __free_pages() of the right order, or maybe only the last had to get that right; but as __free_pages() stands today, all but the last caller frees all but the first subpage. A very rare leak seems much safer. I don't have the answer (find somewhere in struct page to squirrel away the order, even when it's a non-compound page?), and I think each of us would much rather be thinking about other things at the moment. But for now it looks to me like NAK to this patch, and revert of e320d3012d25.
We did discuss that possibility prior to the introduction of e320d3012d25. Here's one such: https://lore.kernel.org/linux-mm/20200922031215.GZ32101-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org/T/#m0b08c0c3430e09e20fa6648877dc42b04b18e6f3