Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Alexander Duyck <hidden>
Date: 2023-06-07 15:06:28
Also in: lkml

On Wed, Jun 7, 2023 at 5:46 AM Yunsheng Lin [off-list ref] wrote:

On 2023/6/6 23:33, Alexander Duyck wrote:

quoted

Do you mean doing something like below? isn't it dirtying the cache line
of 'struct page' whenever a page is recycled, which means we may not be
able to the maintain current performance for non-fragmented or mono-frag
case?

--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c

@@ -583,6 +583,10 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
                     unsigned int dma_sync_size, bool allow_direct)
 {
+
+       if (!page_pool_defrag_page(page, 1))
+               return NULL;
+

Yes, that is pretty much it. This would be your standard case page
pool put path. Basically it allows us to start getting rid of a bunch
of noise in the fragmented path.

quoted

        /* This allocator is optimized for the XDP mode that uses
         * one-frame-per-page, but have fallbacks that act like the
         * regular page allocator APIs.

@@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
         */
        if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
                /* Read barrier done in page_ref_count / READ_ONCE */
+               page_pool_fragment_page(page, 1);

I wouldn't bother resetting this to 1 until after you have recycled it
and pulled it back out again as an allocation. Basically when the
pages are sitting in the pool the frag_count should be 0. That way it
makes it easier to track and is similar to how the memory allocator
actually deals with the page reference count. Basically if the page is
sitting in the pool the frag_count is 0, once it comes out it should
be 1 or more indicating it is in use.

Let's be more specific about what we want to do here：

For a specific page without splitting, the journey that it will go
through is as below before this patch:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
   page_ref_count() checking:
   5.1 page_ref_count() being one, the page is now owned by the page
       pool, and may be passed to the driver by going to step 3.
   5.2 page_ref_count() not being one, the page is released by page
       pool doing resoure cleaning like dma mapping and put_page().

So a page may go through step 3 ~ 5.1 many times without dirtying
the cache line of  'struct page' as my understanding.


If I follow your suggestion here, It seems for a specific page without
splitting, it may go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It's pp_frag_count is set to one.
4. It is passed to driver by page pool.
5. It is passed to stack by the driver.
6. It is passed back to the page pool by the stack, depending on the
   pp_frag_count and page_ref_count() checking:
   6.1 pp_frag_count and page_ref_count() being one, the page is now
       owned by the page pool, and may be passed to the driver by
       going to step 3.
   6.2 otherwise the page is released by page pool doing resoure
       cleaning like dma mapping and put_page().

Aren't we dirtying the cache line of  'struct page' everytime the page
is recycled? Or did I miss something obvious here?

What you are stating makes sense. So we would probably want to keep
the pp_frag_count at 1 when it is sitting in the pool.

For my implementation, for a specific page without splitting, it may
go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info() and it's pp_frag_count
   set to one.
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
   page_ref_count() checking:
   5.1 pp_frag_count and page_ref_count() being one, the page is now
       owned by the page pool, and as the optimization by not updating
       page->pp_frag_count in page_pool_defrag_page() for the last
       frag user, it can be passed to the driver by going to step 3
       without resetting the pp_frag_count to 1, which may dirty
       the cache line of  'struct page'.
   5.2 otherwise the page is released by page pool doing resoure
       cleaning like dma mapping and put_page().

Does it make any sense, or it doesn't really matter we are dirtying
the cache line of  'struct page' whenever a page without splitted is
recycled?

No, that makes sense. No point in dirtying a cache line if we don't have to.

quoted

Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
"freeing it" mean calling put_page()? What does 'combined' really means
here?

The change is that the code would do the subtraction and if it hit 0
it was freeing the page. That is the one piece that gets more
complicated because we really should be hitting 1. So we may be adding
a few more operations to that case.

quoted

quoted

I am not sure I understand it. Does 'gets more complicated' means doing
some optimization like page_pool_defrag_page() does?
https://elixir.bootlin.com/linux/v6.4-rc5/source/include/net/page_pool.h#L314

Our standard case is to decrement by 1. We will need to for the code
that is doing your step 5.1 to handle a case where we are removing
multiple frag references. That is what I was getting at with the "more
complicated" comment. Basically if we push it to 0 then we either have
to free or recycle the page by resetting the fragments.

quoted

With that you could then let drivers like the Mellanox one handle its
own fragmenting knowing it has to return things to a mono-frag in
order for it to be working correctly.

I still really don't how it will be better for mlx5 to handle its
own fragmenting yet?

+cc Dragos & Saeed to share some info here, so that we can see
if page pool learn from it.

It has more to do with the fact that the driver knows what it is going
to do beforehand. In many cases it can look at the page and know that
it isn't going to reuse it again so it can just report the truesize
being the length from the current pointer to the end of the page.

You can think of it as the performance advantage of a purpose built
ASIC versus a general purpose CPU. The fact is we are able to cut out
much of the unnecessary overhead if we know exactly how we are going
to use the memory in the driver versus having to guess at it in the
page pool API.

In general, I would agree with that.
But for the specific case with mlx5, I am not sure about that, that's
why I am curious about what is the exact reason about it doing the
complicated frag_count handing in the driver instead of improving
the page pool to support it's usecase, if it is about the last frag
truesize problem here, we can do something like virtio_net do in the
page pool too.

I suspect it has to do with their hardware doing the fragmentation of
the page. As I recall some of the Mellanox parts support Rx packing so
it is likely that their hardware is packing multiple frames into a
single page and so they are marking it pre-fragmented, and then when
the hardware completes the DMA they go through and record the offsets
for the individual fragments and pass them up the stack.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help