Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache
From: Morten Brørup <hidden>
Date: 2021-11-08 16:03:59
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Honnappa Nagarahalli Sent: Monday, 8 November 2021 16.46 <snip>quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Current mempool per core cache implementation isbasedquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
onpointerquoted
quoted
quoted
quoted
For most architectures, each pointer consumes 64b Replaceitquoted
quoted
quoted
quoted
quoted
quoted
quoted
withquoted
quoted
quoted
quoted
index-based implementation, where in each bufferisquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
addressedbyquoted
quoted
quoted
quoted
(pool address + index)I like Dharmik's suggestion very much. CPU cache is a critical and limited resource. DPDK has a tendency of using pointers where indexescouldquoted
quoted
bequoted
quoted
usedquoted
quoted
quoted
quoted
quoted
quoted
instead. I suppose pointers provide the additional flexibilityofquoted
quoted
quoted
quoted
quoted
quoted
mixing entries from different memory pools, e.g.multiplequoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
mbufpools.quoted
quoted
Agreed, thank you!quoted
quoted
quoted
quoted
quoted
quoted
I don't think it is going to work: On 64-bit systems difference between pool addressandquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
it'selemquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
address could be bigger than 4GB.Are you talking about a case where the memory pool sizeisquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
morethan 4GB?quoted
That is one possible scenario.That could be solved by making the index an elementindexquoted
quoted
quoted
quoted
insteadquoted
quoted
ofquoted
quoted
aquoted
quoted
pointer offset: address = (pool address + index *elementquoted
quoted
quoted
quoted
size).quoted
quoted
quoted
quoted
quoted
Or instead of scaling the index with the element size,whichquoted
quoted
quoted
quoted
quoted
quoted
quoted
isonlyquoted
quoted
known at runtime, the index could be more efficiently scaledbyquoted
quoted
aquoted
quoted
quoted
quoted
compile time constant such as RTE_MEMPOOL_ALIGN (= RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte,thatquoted
quoted
wouldquoted
quoted
quoted
quoted
allow indexing into mempools up to 256 GB in size.quoted
Looking at this snippet [1] fromrte_mempool_op_populate_helper(),quoted
quoted
quoted
quoted
there is an ‘offset’ added to avoid objects to crosspagequoted
quoted
quoted
quoted
quoted
quoted
boundaries.quoted
quoted
If my understanding is correct, using the index ofelementquoted
quoted
quoted
quoted
insteadquoted
quoted
of aquoted
quoted
pointer offset will pose a challenge for some of thecornerquoted
quoted
quoted
quoted
cases.quoted
quoted
quoted
quoted
[1] for (i = 0; i < max_objs; i++) { /* avoid objects to cross page boundaries*/quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
if (check_obj_bounds(va + off, pg_sz,total_elt_sz)quoted
quoted
quoted
quoted
< 0) { off += RTE_PTR_ALIGN_CEIL(va +off,quoted
quoted
quoted
quoted
pg_sz) -quoted
quoted
quoted
quoted
(va + off); if (flags &RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ)quoted
quoted
quoted
quoted
off += total_elt_sz - (((uintptr_t)(va+quoted
quoted
off -quoted
quoted
1) %quoted
quoted
quoted
quoted
total_elt_sz)+quoted
quoted
1);quoted
quoted
quoted
quoted
}OK. Alternatively to scaling the index with a cache linesize,quoted
quoted
youquoted
quoted
can scale it with sizeof(uintptr_t) to be able to address32quoted
quoted
quoted
quoted
quoted
quoted
or16quoted
quoted
GBquoted
quoted
mempools on respectively 64 bit and 32 bit architectures.Bothquoted
quoted
x86quoted
quoted
andquoted
quoted
ARM CPUs have instructions to access memory with an addedoffsetquoted
quoted
quoted
quoted
multiplied by 4 or 8. So that should be high performance. Yes, agreed this can be done. Cache line size can also be used when‘MEMPOOL_F_NO_CACHE_ALIGN’quoted
quoted
quoted
quoted
is not enabled. On a side note, I wanted to better understand the need forhavingquoted
quoted
thequoted
quoted
'MEMPOOL_F_NO_CACHE_ALIGN' option.The description of this field is misleading, and should becorrected.quoted
quoted
quoted
The correct description would be: Don't need to align objs oncachequoted
quoted
lines.quoted
It is useful for mempools containing very small objects, toconservequoted
quoted
memory. I think we can assume that mbuf pools are created with the 'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can useoffsetquoted
quoted
quoted
quoted
calculated with cache line size as the unit.You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-)Yes 😊quoted
I agree. And since the flag is a hint only, it can be ignored ifthequoted
quoted
mempoolquoted
library is scaling the index with the cache line size.I do not think we should ignore the flag for reason you mentionbelow.quoted
quoted
quoted
However, a mempool may contain other objects than mbufs, andthosequoted
quoted
objectsquoted
may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag maycostquoted
aquoted
lot of memory for such mempools.We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set, use the unit as 'sizeof(uintptr_t)', if not set use cache line sizeasquoted
quoted
the unit.That would require that the indexing multiplier is a runtimeparameter insteadquoted
of a compile time parameter. So it would have a performance penalty. The indexing multiplier could be compile time configurable, so it isa tradeoffquoted
between granularity and maximum mempool size.I meant compile time configurable. i.e. #ifdef MEMPOOL_F_NO_CACHE_ALIGN <use sizeof(uintptr_t) as the multiplier> #else <use cache line size as the multiplier> /* This should provide enough memory for packet buffers */ #endif
Please note that MEMPOOL_F_NO_CACHE_ALIGN is a runtime flag passed when creating a mempool, not a compile time option.