Re: Efficient memcpy()/memmove() for G2/G3 cores...

Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-08-25
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Matt Sealey <hidden> · 2008-08-25
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-08-25
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2008-08-25
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Steven Munroe <hidden> · 2008-08-27
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-08-29
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Joakim Tjernlund <hidden> · 2008-08-29
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-01
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Joakim Tjernlund <hidden> · 2008-09-01
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-02
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Joakim Tjernlund <hidden> · 2008-09-03
Re: Efficient memcpy()/memmove() for G2/G3 cores... · prodyut hazarika <hidden> · 2008-09-03
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Paul Mackerras <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Josh Boyer <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Steven Munroe <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Gunnar Von Boehn <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Gunnar Von Boehn <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Gunnar Von Boehn <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · prodyut hazarika <hidden> · 2008-09-04
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Steven Munroe <hidden> · 2008-08-29
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-01
Re: Efficient memcpy()/memmove() for G2/G3 cores... · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2008-08-31
Re: Efficient memcpy()/memmove() for G2/G3 cores... · David Jander <hidden> · 2008-09-01

From: Paul Mackerras <hidden>
Date: 2008-09-04 02:04:58

prodyut hazarika writes:

glibc memxxx for powerpc are horribly inefficient. For optimal performance,
we should should dcbt instruction to establish the source address in cache, and
dcbz to establish the destination address in cache. We should do
dcbt and dcbz such that the touches happen a line ahead of the actual copy.

The problem which is see is that dcbt and dcbz instructions don't work on
non-cacheable memory (obviously!). But memxxx function are used for both
cached and non-cached memory. Thus this optimized memcpy should be smart enough
to figure out that both source and destination address fall in
cacheable space, and only then
used the optimized dcbt/dcbz instructions.

I would be careful about adding overhead to memcpy.  I found that in
the kernel, almost all calls to memcpy are for less than 128 bytes (1
cache line on most 64-bit machines).  So, adding a lot of code to
detect cacheability and do prefetching is just going to slow down the
common case, which is short copies.  I don't have statistics for glibc
but I wouldn't be surprised if most copies were short there also.

The other thing that I have found is that code that is optimal for
cache-cold copies is usually significantly slower than optimal for
cache-hot copies, because the cache management instructions consume
cycles and don't help in the cache-hot case.

In other words, I don't think we should be tuning the glibc memcpy
based on tests of how fast it copies multiple megabytes.

Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit
processors (POWER4/5/6) because the hardware prefetching and
write-combining mean that dcbt/dcbz don't help and just slow things
down.

Paul.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help