Re: [PATCH RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE

From: Jesper Dangaard Brouer <hidden>
Date: 2023-01-19 18:09:07
Also in: linux-mm


On 18/01/2023 06.17, Matthew Wilcox wrote:

On Tue, Jan 17, 2023 at 03:54:34PM +0100, Christoph Lameter wrote:

quoted

On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:

quoted

When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).

Well that is a common effect that we see in multiple subsystems. This is
due to general memory fragmentation. Depending on the prior load the
performance could actually be better after some runtime if the caches are
populated avoiding the page allocator etc.

The page allocator isn't _that_ expensive.  I could see updating several
slabs being more expensive than allocating a new page.

For 10Gbit/s wirespeed small frames I have 201 cycles as budget.

I prefer to measure things, so lets see what page alloc cost, but also
relate this to how much this is per 4096 bytes.

  alloc_pages order:0(4096B/x1)    246 cycles per-4096B 246 cycles
  alloc_pages order:1(8192B/x2)    300 cycles per-4096B 150 cycles
  alloc_pages order:2(16384B/x4)   328 cycles per-4096B 82 cycles
  alloc_pages order:3(32768B/x8)   357 cycles per-4096B 44 cycles
  alloc_pages order:4(65536B/x16)  516 cycles per-4096B 32 cycles
  alloc_pages order:5(131072B/x32) 801 cycles per-4096B 25 cycles

I looked back at my MM-presentation[2016][2017], and notice that in
[2017] I reported that Mel have improved order-0 page cost to 143 cycles
in kernel 4.11-rc1.  According to above measurements kernel have
regressed in performance.


[2016] 
https://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
[2017] 
https://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

quoted

The merging could actually be beneficial since there may be more partial
slabs to allocate from and thus avoiding expensive calls to the page
allocator.

What might be more effective is allocating larger order slabs.  I see
that kmalloc-256 allocates a pair of pages and manages 32 objects within
that pair.  It should perform better in Jesper's scenario if it allocated
4 pages and managed 64 objects per slab.

Simplest way to test that should be booting a kernel with
'slub_min_order=2'.  Does that help matters at all, Jesper?  You could
also try slub_min_order=3.  Going above that starts to get a bit sketchy.

I have tried this slub_min_order trick before, and it did help.  I've
not tested it is recently.

--Jesper

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help