Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache

From: Morten Brørup <hidden>
Date: 2021-11-08 16:03:59

From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Honnappa
Nagarahalli
Sent: Monday, 8 November 2021 16.46

<snip>
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Current mempool per core cache implementation is
based
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
on
pointer
quoted
quoted
quoted
quoted
For most architectures, each pointer consumes 64b
Replace
it
quoted
quoted
quoted
quoted
quoted
quoted
quoted
with
quoted
quoted
quoted
quoted
index-based implementation, where in each buffer
is
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
addressed
by
quoted
quoted
quoted
quoted
(pool address + index)
I like Dharmik's suggestion very much. CPU cache is a
critical and limited resource.

DPDK has a tendency of using pointers where indexes
could
quoted
quoted
be
quoted
quoted
used
quoted
quoted
quoted
quoted
quoted
quoted
instead. I suppose pointers provide the additional
flexibility
of
quoted
quoted
quoted
quoted
quoted
quoted
mixing entries from different memory pools, e.g.
multiple
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
mbuf
pools.
quoted
quoted
Agreed, thank you!
quoted
quoted
quoted
quoted
quoted
quoted
I don't think it is going to work:
On 64-bit systems difference between pool address
and
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
it's
elem
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
address could be bigger than 4GB.
Are you talking about a case where the memory pool
size
is
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
more
than 4GB?
quoted
That is one possible scenario.
That could be solved by making the index an element
index
quoted
quoted
quoted
quoted
instead
quoted
quoted
of
quoted
quoted
a
quoted
quoted
pointer offset: address = (pool address + index *
element
quoted
quoted
quoted
quoted
size).
quoted
quoted
quoted
quoted
quoted
Or instead of scaling the index with the element size,
which
quoted
quoted
quoted
quoted
quoted
quoted
quoted
is
only
quoted
quoted
known at runtime, the index could be more efficiently
scaled
by
quoted
quoted
a
quoted
quoted
quoted
quoted
compile time constant such as RTE_MEMPOOL_ALIGN (=
RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte,
that
quoted
quoted
would
quoted
quoted
quoted
quoted
allow indexing into mempools up to 256 GB in size.
quoted
Looking at this snippet [1] from
rte_mempool_op_populate_helper(),
quoted
quoted
quoted
quoted
there is an ‘offset’ added to avoid objects to cross
page
quoted
quoted
quoted
quoted
quoted
quoted
boundaries.
quoted
quoted
If my understanding is correct, using the index of
element
quoted
quoted
quoted
quoted
instead
quoted
quoted
of a
quoted
quoted
pointer offset will pose a challenge for some of the
corner
quoted
quoted
quoted
quoted
cases.
quoted
quoted
quoted
quoted
[1]
       for (i = 0; i < max_objs; i++) {
               /* avoid objects to cross page boundaries
*/
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
               if (check_obj_bounds(va + off, pg_sz,
total_elt_sz)
quoted
quoted
quoted
quoted
<
0) {
                       off += RTE_PTR_ALIGN_CEIL(va +
off,
quoted
quoted
quoted
quoted
pg_sz) -
quoted
quoted
quoted
quoted
(va + off);
                       if (flags &
RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ)
quoted
quoted
quoted
quoted
                               off += total_elt_sz -
                                       (((uintptr_t)(va
+
quoted
quoted
off -
quoted
quoted
1) %
quoted
quoted
quoted
quoted
total_elt_sz)
+
quoted
quoted
1);
quoted
quoted
quoted
quoted
               }
OK. Alternatively to scaling the index with a cache line
size,
quoted
quoted
you
quoted
quoted
can scale it with sizeof(uintptr_t) to be able to address
32
quoted
quoted
quoted
quoted
quoted
quoted
or
16
quoted
quoted
GB
quoted
quoted
mempools on respectively 64 bit and 32 bit architectures.
Both
quoted
quoted
x86
quoted
quoted
and
quoted
quoted
ARM CPUs have instructions to access memory with an added
offset
quoted
quoted
quoted
quoted
multiplied by 4 or 8. So that should be high performance.

Yes, agreed this can be done.
Cache line size can also be used when
‘MEMPOOL_F_NO_CACHE_ALIGN’
quoted
quoted
quoted
quoted
is not enabled.
On a side note, I wanted to better understand the need for
having
quoted
quoted
the
quoted
quoted
'MEMPOOL_F_NO_CACHE_ALIGN' option.
The description of this field is misleading, and should be
corrected.
quoted
quoted
quoted
The correct description would be: Don't need to align objs on
cache
quoted
quoted
lines.
quoted
It is useful for mempools containing very small objects, to
conserve
quoted
quoted
memory.
I think we can assume that mbuf pools are created with the
'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can use
offset
quoted
quoted
quoted
quoted
calculated with cache line size as the unit.
You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-)
Yes 😊
quoted
I agree. And since the flag is a hint only, it can be ignored if
the
quoted
quoted
mempool
quoted
library is scaling the index with the cache line size.
I do not think we should ignore the flag for reason you mention
below.
quoted
quoted
quoted
However, a mempool may contain other objects than mbufs, and
those
quoted
quoted
objects
quoted
may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag may
cost
quoted
a
quoted
lot of memory for such mempools.
We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set,
use the unit as 'sizeof(uintptr_t)', if not set use cache line size
as
quoted
quoted
the unit.
That would require that the indexing multiplier is a runtime
parameter instead
quoted
of a compile time parameter. So it would have a performance penalty.

The indexing multiplier could be compile time configurable, so it is
a tradeoff
quoted
between granularity and maximum mempool size.
I meant compile time configurable. i.e.

#ifdef MEMPOOL_F_NO_CACHE_ALIGN
<use sizeof(uintptr_t) as the multiplier>
#else
<use cache line size as the multiplier> /* This should provide enough
memory for packet buffers */
#endif

Please note that MEMPOOL_F_NO_CACHE_ALIGN is a runtime flag passed when creating a mempool, not a compile time option.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help