Re: [PATCH RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE
From: Jesper Dangaard Brouer <hidden>
Date: 2023-01-19 18:09:07
Also in:
linux-mm
On 18/01/2023 06.17, Matthew Wilcox wrote:
On Tue, Jan 17, 2023 at 03:54:34PM +0100, Christoph Lameter wrote:quoted
On Tue, 17 Jan 2023, Jesper Dangaard Brouer wrote:quoted
When running different network performance microbenchmarks, I started to notice that performance was reduced (slightly) when machines had longer uptimes. I believe the cause was 'skbuff_head_cache' got aliased/merged into the general slub for 256 bytes sized objects (with my kernel config, without CONFIG_HARDENED_USERCOPY).Well that is a common effect that we see in multiple subsystems. This is due to general memory fragmentation. Depending on the prior load the performance could actually be better after some runtime if the caches are populated avoiding the page allocator etc.The page allocator isn't _that_ expensive. I could see updating several slabs being more expensive than allocating a new page.
For 10Gbit/s wirespeed small frames I have 201 cycles as budget. I prefer to measure things, so lets see what page alloc cost, but also relate this to how much this is per 4096 bytes. alloc_pages order:0(4096B/x1) 246 cycles per-4096B 246 cycles alloc_pages order:1(8192B/x2) 300 cycles per-4096B 150 cycles alloc_pages order:2(16384B/x4) 328 cycles per-4096B 82 cycles alloc_pages order:3(32768B/x8) 357 cycles per-4096B 44 cycles alloc_pages order:4(65536B/x16) 516 cycles per-4096B 32 cycles alloc_pages order:5(131072B/x32) 801 cycles per-4096B 25 cycles I looked back at my MM-presentation[2016][2017], and notice that in [2017] I reported that Mel have improved order-0 page cost to 143 cycles in kernel 4.11-rc1. According to above measurements kernel have regressed in performance. [2016] https://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf [2017] https://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
quoted
The merging could actually be beneficial since there may be more partial slabs to allocate from and thus avoiding expensive calls to the page allocator.What might be more effective is allocating larger order slabs. I see that kmalloc-256 allocates a pair of pages and manages 32 objects within that pair. It should perform better in Jesper's scenario if it allocated 4 pages and managed 64 objects per slab. Simplest way to test that should be booting a kernel with 'slub_min_order=2'. Does that help matters at all, Jesper? You could also try slub_min_order=3. Going above that starts to get a bit sketchy.
I have tried this slub_min_order trick before, and it did help. I've not tested it is recently. --Jesper