Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy
From: Johannes Weiner <hidden>
Date: 2021-03-23 17:02:51
Also in:
linux-mm, lkml, netdev
On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote:
To make sure we're on the same page, then, here's a tentative mechanism - I'd rather get buy in before spending too much time on something that wouldn't pass muster afterwards. A) An opt-in mechanism, that a driver needs to explicitly support, in order to get properly accounted receive zerocopy.
Yep, opt-in makes sense. That allows piece-by-piece conversion and avoids us having to have a flag day.
B) Failure to opt-in (e.g. unchanged old driver) can either lead to unaccounted zerocopy (ie. existing behaviour) or, optionally, effectively disabled zerocopy (ie. any call to zerocopy will return something like EINVAL) (perhaps controlled via some sysctl, which either lets zerocopy through or not with/without accounting).
I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not disturb existing/working setups during the transition period. But the exact policy is easy to change later on if we change our minds on it.
The proposed mechanism would involve: 1) Some way of marking a page as being allocated by a driver that has decided to opt into this mechanism. Say, a page flag, or a memcg flag.
Right. I would stress it should not be a memcg flag or any direct channel from the network to memcg, as this would limit its usefulness while having the same maintenance overhead. It should make the network page a first class MM citizen - like an LRU page or a slab page - which can be accounted and introspected as such, including from the memcg side. So definitely a page flag.
2) A callback provided by the driver, that takes a struct page*, and returns a boolean. The value of the boolean being true indicates that any and all refs on the page are held by the driver. False means there exists at least one reference that is not held by the driver.
I was thinking the PageNetwork flag would cover this, but maybe I'm missing something?
3) A branch in put_page() that, for pages marked thus, will consult the driver callback and if it returns true, will uncharge the memcg for the page.
The way I picture it, put_page() (and release_pages) should do this:
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
put_dev_pagemap(page->pgmap);
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
return;
}
+
+ if (PageNetwork(page)) {
+ put_page_network(page);
+ /* Page belongs to the network stack, not the page allocator */
+ return;
+ }
if (unlikely(PageCompound(page)))
__put_compound_page(page);
else
__put_single_page(page);
}
where put_page_network() is the network-side callback that uncharges
the page.
(..and later can be extended to do all kinds of things when informed
that the page has been freed: update statistics (mod_page_state), put
it on a private network freelist, or ClearPageNetwork() and give it
back to the page allocator etc.
But for starters it can set_page_count(page, 1) after the uncharge to
retain the current silent recycling behavior.)
The anonymous struct you defined above is part of a union that I think normally is one qword in length (well, could be more depending on the typedefs I saw there) and I think that can be co-opted to provide the driver callback - though, it might require growing the struct by one more qword since there may be drivers like mlx5 that are already using the field already in there for dma_addr.
The page cache / anonymous struct it's shared with is 5 words (double linked list pointers, mapping, index, private), and the network struct is currently one word, so you can add 4 words to a PageNetwork() page without increasing the size of struct page. That should be plenty of space to store auxiliary data for drivers, right?
Anyways, the callback could then be used by the driver to handle the other accounting quirks you mentioned, without needing to scan the full pool.
Right.
Of course there are corner cases and such to properly account for, but I just wanted to provide a really rough sketch to see if this (assuming it were properly implemented) was what you had in mind. If so I can put together a v3 patch.
Yeah, makes perfect sense. We can keep iterating like this any time you feel you accumulate too many open questions. Not just for MM but also for the networking folks - although I suspect that the first step would be mostly about the MM infrastructure, and I'm not sure how much they care about the internals there ;)
Per my response to Andrew earlier, this would make it even more confusing whether this is to be applied against net-next or mm trees. But that's a bridge to cross when we get to it.
The mm tree includes -next, so it should be a safe development target for the time being. I would then decide it based on how many changes your patch interacts with on either side. Changes to struct page and the put path are not very frequent, so I suspect it'll be easy to rebase to net-next and route everything through there. And if there are heavy changes on both sides, the -mm tree is the better route anyway. Does that sound reasonable?