Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Yunsheng Lin <hidden>
Date: 2023-06-03 13:02:23
Also in: lkml

On 2023/6/3 0:37, Alexander H Duyck wrote:
...

quoted

Please let me know if the above makes sense, or if misunderstood your
concern here.

So my main concern is that what this is doing is masking things so that
the veth and virtio_net drivers can essentially lie about the truesize
of the memory they are using in order to improve their performance by
misleading the socket layer about how much memory it is actually
holding onto.

We have historically had an issue with reporting the truesize of
fragments, but generally the underestimation was kept to something less
than 100% because pages were generally split by at least half. Where it
would get messy is if a misbehaviing socket held onto packets for an
exceedingly long time.

What this patch set is doing is enabling explicit lying about the
truesize, and it compounds that by allowing for mixing small
allocations w/ large ones.

quoted

The problem is there are some architectures where we just cannot
support having pp_frag_count due to the DMA size. So it makes sense to
leave those with just basic page pool instead of trying to fake that it
is a fragmented page.

It kind of depend on how you veiw it, this patch view it as only supporting
one frag when we can't support having pp_frag_count, so I would not call it
faking.

So the big thing that make it "faking" is the truesize underestimation
that will occur with these frames.

Let's discuss truesize issue in patch 2 instead of here.
Personally, I still believe that if the driver can compute the
truesize correctly by manipulating the page->pp_frag_count and
frag offset directly, the page pool can do that too.

quoted
quoted
quoted
---

...

quoted

What is the point of this line? It doesn't make much sense to me. Are
you just trying to force an optiimization? You would be better off just
taking the BUILD_BUG_ON contents and feeding them into an if statement
below since the statement will compile out anyway.

if the "if statement" you said refers to the below, then yes.

quoted

+		if (!__builtin_constant_p(nr))
+			atomic_long_set(&page->pp_frag_count, 1);

But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?

Will move it down anyway to avoid confusion.

Actually now that I look at this more it is even more confusing. The
whole point of this function was that we were supposed to be getting
pp_frag_count to 0. However you are setting it to 1.

This is seriously flawed. If we are going to treat non-fragmented pages
as mono-frags then that is what we should do. We should be pulling this
acounting into all of the page pool freeing paths, not trying to force
the value up to 1 for the non-fragmented case.

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

quoted

It seems like what you would want here is:
	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);

Otherwise you are potentially writing to a variable that shouldn't
exist.

Not if the driver use the page_pool_alloc_frag() API instead of manipulating
the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
to keep the driver from implementation detail(pp_frag_count handling specifically)
of the frag support unless we have a very good reason.

Getting the truesize is that "very good reason". The fact is the
drivers were doing this long before page pool came around. Trying to
pull that away from them is the wrong way to go in my opinion.

If the truesize is really the concern here, I think it make more
sense to enforce it in the page pool instead of each driver doing
their trick, so I also think we can do better here to handle
pp_frag_count in the page pool instead of driver handling it, so
let's continue the truesize disscussion in patch 2 to see if we
can come up with something better there.

quoted

 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page. No need to actually overwrite it, instead
 	 * we can leave this to be overwritten by the calling function.

@@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
 	 * especially when dealing with a page that may be partitioned
 	 * into only 2 or 3 pieces.
 	 */
-	if (atomic_long_read(&page->pp_frag_count) == nr)
+	if (atomic_long_read(&page->pp_frag_count) == nr) {
+		/* As we have ensured nr is always one for constant case
+		 * using the BUILD_BUG_ON() as above, only need to handle
+		 * the non-constant case here for frag count draining.
+		 */
+		if (!__builtin_constant_p(nr))
+			atomic_long_set(&page->pp_frag_count, 1);
+
 		return 0;
+	}
 

The optimization here was already the comparison since we didn't have
to do anything if pp_frag_count == nr. The whole point of pp_frag_count
going to 0 is that is considered non-fragmented in that case and ready
to be freed. By resetting it to 1 you are implying that there is still
one *other* user that is holding a fragment so the page cannot be
freed.

We weren't bothering with writing the value since the page is in the
free path and this value is going to be unused until the page is
reallocated anyway.

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

quoted

 	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
 	WARN_ON(ret < 0);
+
+	/* Reset frag count back to 1, this should be the rare case when
+	 * two users call page_pool_defrag_page() currently.
+	 */
+	if (!ret)
+		atomic_long_set(&page->pp_frag_count, 1);
+
 	return ret;
 }

...

quoted

As above, it is about unifying handling for frag and non-frag page in
page_pool_is_last_frag(). please let me know if there is any better way
to do it without adding statements here.

I get what you are trying to get at but I feel like the implementation
is going to cause more problems than it helps. The problem is it is
going to hurt base page pool performance and it just makes the
fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

My advice as a first step would be to look at first solving how to
enable the PP_FLAG_PAGE_FRAG mode when you have
PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
frags as we are calling them, and we should have a way to get the
truesize for those so we know when we are consuming significant amount
of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

Once that is solved then we can look at what it would take to apply
mono-frags to the standard page pool case. Ideally we would need to
find a way to do it with minimal overhead.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help