Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

From: Johannes Weiner <hidden>
Date: 2021-03-23 17:02:51
Also in: linux-mm, lkml, netdev

On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote:

To make sure we're on the same page, then, here's a tentative
mechanism - I'd rather get buy in before spending too much time on
something that wouldn't pass muster afterwards.

A) An opt-in mechanism, that a driver needs to explicitly support, in
order to get properly accounted receive zerocopy.

Yep, opt-in makes sense. That allows piece-by-piece conversion and
avoids us having to have a flag day.

B) Failure to opt-in (e.g. unchanged old driver) can either lead to
unaccounted zerocopy (ie. existing behaviour) or, optionally,
effectively disabled zerocopy (ie. any call to zerocopy will return
something like EINVAL) (perhaps controlled via some sysctl, which
either lets zerocopy through or not with/without accounting).

I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not
disturb existing/working setups during the transition period. But the
exact policy is easy to change later on if we change our minds on it.

The proposed mechanism would involve:
1) Some way of marking a page as being allocated by a driver that has
decided to opt into this mechanism. Say, a page flag, or a memcg flag.

Right. I would stress it should not be a memcg flag or any direct
channel from the network to memcg, as this would limit its usefulness
while having the same maintenance overhead.

It should make the network page a first class MM citizen - like an LRU
page or a slab page - which can be accounted and introspected as such,
including from the memcg side.

So definitely a page flag.

2) A callback provided by the driver, that takes a struct page*, and
returns a boolean. The value of the boolean being true indicates that
any and all refs on the page are held by the driver. False means there
exists at least one reference that is not held by the driver.

I was thinking the PageNetwork flag would cover this, but maybe I'm
missing something?

3) A branch in put_page() that, for pages marked thus, will consult
the driver callback and if it returns true, will uncharge the memcg
for the page.

The way I picture it, put_page() (and release_pages) should do this:

void __put_page(struct page *page)
{
        if (is_zone_device_page(page)) {
                put_dev_pagemap(page->pgmap);

                /*
                 * The page belongs to the device that created pgmap. Do
                 * not return it to page allocator.
                 */
                return;
        }
+
+	if (PageNetwork(page)) {
+		put_page_network(page);
+		/* Page belongs to the network stack, not the page allocator */
+		return;
+	}

        if (unlikely(PageCompound(page)))
                __put_compound_page(page);
        else
                __put_single_page(page);
}

where put_page_network() is the network-side callback that uncharges
the page.

(..and later can be extended to do all kinds of things when informed
that the page has been freed: update statistics (mod_page_state), put
it on a private network freelist, or ClearPageNetwork() and give it
back to the page allocator etc.

But for starters it can set_page_count(page, 1) after the uncharge to
retain the current silent recycling behavior.)

The anonymous struct you defined above is part of a union that I think
normally is one qword in length (well, could be more depending on the
typedefs I saw there) and I think that can be co-opted to provide the
driver callback - though, it might require growing the struct by one
more qword since there may be drivers like mlx5 that are already using
the field already in there  for dma_addr.

The page cache / anonymous struct it's shared with is 5 words (double
linked list pointers, mapping, index, private), and the network struct
is currently one word, so you can add 4 words to a PageNetwork() page
without increasing the size of struct page. That should be plenty of
space to store auxiliary data for drivers, right?

Anyways, the callback could then be used by the driver to handle the
other accounting quirks you mentioned, without needing to scan the
full pool.

Right.

Of course there are corner cases and such to properly account for, but
I just wanted to provide a really rough sketch to see if this
(assuming it were properly implemented) was what you had in mind. If
so I can put together a v3 patch.

Yeah, makes perfect sense. We can keep iterating like this any time
you feel you accumulate too many open questions. Not just for MM but
also for the networking folks - although I suspect that the first step
would be mostly about the MM infrastructure, and I'm not sure how much
they care about the internals there ;)

Per my response to Andrew earlier, this would make it even more
confusing whether this is to be applied against net-next or mm trees.
But that's a bridge to cross when we get to it.

The mm tree includes -next, so it should be a safe development target
for the time being.

I would then decide it based on how many changes your patch interacts
with on either side. Changes to struct page and the put path are not
very frequent, so I suspect it'll be easy to rebase to net-next and
route everything through there. And if there are heavy changes on both
sides, the -mm tree is the better route anyway.

Does that sound reasonable?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help