Re: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag | linux-arm-kernel

(off-list ancestor, not in this archive)

On Fri, Jun 16, 2023 at 5:21 AM Yunsheng Lin [off-list ref] wrote:
On 2023/6/16 2:26, Alexander Duyck wrote:
On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski [off-list ref] wrote:
On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
Does hns3_page_order() set a good example for the users?

static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
{
#if (PAGE_SIZE < 8192)
    if (ring->buf_size > (PAGE_SIZE / 2))
            return 1;
#endif
    return 0;
}

Why allocate order 1 pages for buffers which would fit in a single page?
I feel like this soft of heuristic should be built into the API itself.
hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
with 4096 bytes and system page size with 4K, as hns3 driver still support the
per-desc ping-pong way of page splitting when page_pool_enabled is false.

With page pool enabled, you are right that order 0 pages is enough, and I am not
sure about the exact reason we use the some order as the ping-pong way of page
splitting now.
As 2048 bytes buf size seems to be the default one, and I has not heard any one
changing it. Also, it caculates the pool_size using something as below, so the
memory usage is almost the same for order 0 and order 1:

.pool_size = ring->desc_num * hns3_buf_size(ring) /
              (PAGE_SIZE << hns3_page_order(ring)),

I am not sure it worth changing it, maybe just change it to set good example for
the users:) anyway I need to discuss this with other colleague internally and do
some testing before doing the change.
Right, I think this may be a leftover from the page flipping mode of
operation. But AFAIU we should leave the recycling fully to the page
pool now. If we make any improvements try to make them at the page pool
level.
I checked, the per-desc buf with 4096 bytes for hnse does not seem to
be used mainly because of the larger memory usage you mentioned below.

I like your patches as they isolate the drivers from having to make the
fragmentation decisions based on the system page size (4k vs 64k but
we're hearing more and more about ARM w/ 16k pages). For that use case
this is great.
Yes, That is my point. For hw case, the page splitting in page pool is
mainly to enble multi-descs to use the same page as my understanding.

What we don't want is drivers to start requesting larger page sizes
because it looks good in iperf on a freshly booted, idle system :(
Actually that would be a really good direction for this patch set to
look at going into. Rather than having us always allocate a "page" it
would make sense for most drivers to allocate a 4K fragment or the
like in the case that the base page size is larger than 4K. That might
be a good use case to justify doing away with the standard page pool
page and look at making them all fragmented.
I am not sure if I understand the above, isn't the frag API able to
support allocating a 4K fragment when base page size is larger than
4K before or after this patch? what more do we need to do?
I'm not talking about the frag API. I am talking about the
non-fragmented case. Right now standard page_pool will allocate an
order 0 page. So if a driver is using just pages expecting 4K pages
that isn't true on these ARM or PowerPC systems where the page size is
larger than 4K.

For a bit of historical reference on igb/ixgbe they had a known issue
where they would potentially run a system out of memory when page size
was larger than 4K. I had originally implemented things with just the
refcounting hack and at the time it worked great on systems with 4K
pages. However on a PowerPC it would trigger OOM errors because they
could run with 64K pages. To fix that I started adding all the
PAGE_SIZE checks in the driver and moved over to a striping model for
those that would free the page when it reached the end in order to
force it to free the page and make better use of the available memory.

In the case of the standard page size being 4K a standard page would
just have to take on the CPU overhead of the atomic_set and
atomic_read for pp_ref_count (new name) which should be minimal as on
most sane systems those just end up being a memory write and read.
If I understand you correctly, I think what you are trying to do
may break some of Jesper' benchmarking:)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
So? If it breaks an out-of-tree benchmark the benchmark can always be
fixed. The point is enabling a use case that can add value across the
board instead of trying to force the community to support a niche use
case.

Ideally we should get away from using the pages directly for most
cases in page pool. In my mind the page pool should start operating
more like __get_free_pages where what you get is a virtual address
instead of the actual page. That way we could start abstracting it
away and eventually get to something more like a true page_pool api
instead of what feels like a set of add-ons for the page allocator.
Although at the end of the day this still feels more like we are just
reimplementing slab so it is hard for me to say this is necessarily
the best solution either.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help