Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling

From: Yunsheng Lin <hidden>
Date: 2023-06-06 12:42:02
Also in: lkml

On 2023/6/5 22:58, Alexander Duyck wrote:

...

quoted

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

What you are attempting to generate are "mono-frags" where a page pool
page has a frag count of 1. I refer to "non-fragmented pages" as the
legacy page pool page without pp_frags set.

The "page-pool freeing paths" are the ones outside of the fragmented
bits here. Basically __page_pool_put_page and the like. What you
should be doing is pushing the reference counting code down deeper
into the page pool logic. Currently it is more of a surface setup.

The whole point I am getting at with this is that we should see the
number of layers reduced for the fragmented pages, and by converting
the non-fragmented pages to mono-frags we should see that maintain its
current performance and total number of layers instead of having more
layers added to it.

Do you mean reducing the number of layers for the fragmented pages by
moving the page->pp_frag_count handling from page_pool_defrag_page()
to __page_pool_put_page() where page->_refcount is checked?

Or merge page->pp_frag_count into page->_refcount so that we don't
need page->pp_frag_count anymore?

As my understanding, when a page from page pool is passed to the stack
to be processed, the stack may hold onto that page by taking
page->_refcount too, which means page pool has no control over who will
hold onto and when that taken will be released, that is why page pool
do the "page_ref_count(page) == 1" checking in __page_pool_put_page(),
if it is not true, the page pool can't recycle the page, so pp_frag_count
and _refcount have different meaning and serve different purpose, merging
them doesn't work, and moving them to one place doesn't make much sense
too?

Or is there other obvious consideration that I missed?

quoted

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

Basically what should be happening if all page-pool pages are to be
considered "fragmented" is that we should be folding this into the
freeing logic. What we now have a 2 stage setup where we are dropping
the count to 0, then rebounding it and setting it back to 1. If we are
going to have all page pool pages fragmented then the freeing path for
page pool pages should just be handling frag count directly instead of
hacking on it here and ignoring it in the actual freeing paths.

Do you mean doing something like below? isn't it dirtying the cache line
of 'struct page' whenever a page is recycled, which means we may not be
able to the maintain current performance for non-fragmented or mono-frag
case?

--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c

@@ -583,6 +583,10 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
                     unsigned int dma_sync_size, bool allow_direct)
 {
+
+       if (!page_pool_defrag_page(page, 1))
+               return NULL;
+
        /* This allocator is optimized for the XDP mode that uses
         * one-frame-per-page, but have fallbacks that act like the
         * regular page allocator APIs.

@@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
         */
        if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
                /* Read barrier done in page_ref_count / READ_ONCE */
+               page_pool_fragment_page(page, 1);

                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
                        page_pool_dma_sync_for_device(pool, page,

quoted

   ret = atomic_long_sub_return(nr, &page->pp_frag_count);
   WARN_ON(ret < 0);
+
+  /* Reset frag count back to 1, this should be the rare case when
+   * two users call page_pool_defrag_page() currently.
+   */
+  if (!ret)
+          atomic_long_set(&page->pp_frag_count, 1);
+
   return ret;
 }

...

quoted

As above, it is about unifying handling for frag and non-frag page in
page_pool_is_last_frag(). please let me know if there is any better way
to do it without adding statements here.

I get what you are trying to get at but I feel like the implementation
is going to cause more problems than it helps. The problem is it is
going to hurt base page pool performance and it just makes the
fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

I have no problem with getting rid of the flag.

quoted

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

The problem here is that instead of treating all page pool pages as
fragmented, what the patch set has done is added a shim layer so that
you are layering fragmentation on top of page pool pages which was
already the case.

That is why I have suggested make page pool pages a "mono-frag" as
your first patch. Basically it is going to force you to have to set
the pp_frag value for these pages, and verify it is 1 when you are
freeing it.

It seems it is bascially what this patch do with minimal
overhead to the previous users.

Let me try again with what this patch mainly do:

Currently when page_pool_create() is called with
PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only
allowed to be called under the below constraints:
1. page_pool_fragment_page() need to be called to setup
   page->pp_frag_count immediately.
2. page_pool_defrag_page() often need to be called to
   drain the page->pp_frag_count when there is no more
   user will be holding on to that page.

Those constraints exist in order to support a page to
be splitted into multi frags.

And those constraints have some overhead because of the
cache line dirtying/bouncing and atomic update.

Those constraints are unavoidable for case when we need
a page to be splitted into more than one frag, but there
is also case that we want to avoid the above constraints
and their overhead when a page can't be splitted as it
can only hold a big frag as requested by user, depending
on different use cases:
use case 1: allocate page without page splitting.
use case 2: allocate page with page splitting.
use case 3: allocate page with or without page splitting
            depending on the frag size.

Currently page pool only provide page_pool_alloc_pages()
and page_pool_alloc_frag() API to enable the above 1 & 2
separately, so we can not use a combination of 1 & 2 to
enable 3, it is not possible yet because of the per
page_pool flag PP_FLAG_PAGE_FRAG.

So in order to allow allocating unsplitted page without
the overhead of splitted page while still allow allocating
splitted page, we need to remove the per page_pool flag
in page_pool_is_last_frag(), as best as I can think of, it
seems there are two methods as below:
1. Add per page flag/bit to indicate a page is splitted or
   not, which means we might need to update that flag/bit
   everytime the page is recycled, dirtying the cache line
   of 'struct page' for use case 1.
2. Unify the page->pp_frag_count handling for both splitted
   and unsplitted page by assuming all pages in the page
   pool is splitted into a big frag initially.

Because we want to support the above use case 3 with minimal
overhead, especially not adding any noticable overhead for
use case 1, and we are already doing an optimization by not
updating pp_frag_count in page_pool_defrag_page() for the
last frag user, this patch chooses to unify the pp_frag_count
handling to support the above use case 3.

Let me know if it is making any sense here.

Then you are going to have to modify the fragmented cases to make use
of lower level calls because now instead of us defragging a fragmented
page, and then freeing it the two operations essentially have to be
combined into one operation.

Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
"freeing it" mean calling put_page()? What does 'combined' really means
here?

quoted

My advice as a first step would be to look at first solving how to
enable the PP_FLAG_PAGE_FRAG mode when you have
PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
frags as we are calling them, and we should have a way to get the
truesize for those so we know when we are consuming significant amount
of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

It doesn't add any value. All you are doing is passing the "size"
value as "truesize". The whole point of the "truesize" would be to
report the actual size. So a step in that direction would be to bump
truesize to include the remainder that isn't used when you decide it
is time to allocate a new page. The problem is that it requires some
fore-knowledge of what the next requested size is going to be. That is
why it is better to just have the drivers manage this since they know
what size they typically request and when they are going to close
pages.

Like I said, if you are wanting to go down this path you are better
off starting with page pool and making all regular page pool pages
into mono-frags. Then from there we can start building things out.

'mono-frag' means page with pp_frag_count being one. If yes, then I
feel like we have the same goal here, but may have different opinion
on how to implement it.

With that you could then let drivers like the Mellanox one handle its
own fragmenting knowing it has to return things to a mono-frag in
order for it to be working correctly.

I still really don't how it will be better for mlx5 to handle its
own fragmenting yet?

+cc Dragos & Saeed to share some info here, so that we can see
if page pool learn from it.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help