Re: [Cbe-oss-dev] [RFC 1/3] powerpc: __copy_tofrom_user tweaked for Cell

From: Paul Mackerras <hidden>
Date: 2008-06-21 04:30:02

Arnd Bergmann writes:

On Friday 20 June 2008, Paul Mackerras wrote:

quoted

Transferring data over loopback is possibly an exception to that.
However, it's very rare to transfer large amounts of data over
loopback, unless you're running a benchmark like iperf or netperf. :-/

Well, it is the exact case that came up in a real world scenario
for cell: On a network intensive application where the SPUs are
supposed to do all the work, we ended up not getting enough
data in and out through gbit ethernet because the PPU spent

			  ^^^^^^^^^^^^^
Which isn't loopback... :)

I have no objection to improving copy_tofrom_user, memcpy and
copy_page.  I just want to make sure that we don't make things worse
on some platform.

In fact, Mark and I dug up some experiments I had done 5 or 6 years
ago and just ran through all the copy loops I tried back then, on
QS22, POWER6, POWER5+, POWER5, POWER4, 970, and POWER3, and compared
them to the current kernel routines and the proposed new Cell
routines.  So far we have just looked at the copy_page case (i.e. 4kB
on a 4kB alignment) for cache-cold and cache-hot cases.
Interestingly, some of the routines I discarded back then turn out to
do really well on most of the modern platforms, and quite a lot better
on Cell than Gunnar's code does (~10GB/s vs. ~5.5GB/s in the hot-cache
case, IIRC).  Mark is going to summarise the results and also measure
the speed for smaller copies and misaligned copies.

As for the distribution of sizes, I think it would be worthwhile to
run a fresh set of tests.  As I said, my previous results showed most
copies to be either small (<= 128B) or a multiple of 4k, and I think
that was true for copy_tofrom_user as well as memcpy, but that was a
while ago.

much of its time in copy_to_user.

Going to 10gbit will make the problem even more apparent.

Is this application really transferring bulk data and using buffers
that aren't a multiple of the page size?  Do you know whether the
copies ended up being misaligned?

Of course, if we really want the fastest copy possible, the thing to
do is to use VMX loads and stores on 970, POWER6 and Cell.  The
overhead of setting up to use VMX in the kernel would probably kill
any advantage, though -- at least, that's what I found when I tried
using VMX for copy_page in the kernel on 970 a few years ago.

Doing some static compile-time analysis, I found that most
of the call sites (which are not necessarily most of
the run time calls) pass either a small constant size of
less than a few cache lines, or have a variable size but are
not at all performance critical.
Since the prefetching and cache line size awareness was
most of the improvement for cell (AFAIU), maybe we can
annotate the few interesting cases, say by introducing a
new copy_from_user_large() function that can be easily
optimized for large transfers on a given CPU, while
the remaining code keeps optmizing for small transfers
and may even get rid of the full page copy optimization
in order to save a branch.

Let's see what Mark comes up with.  We may be able to find a way to do
it that works well across all current CPUs and also is OK for small
copies.  If not we might need to do what you suggest.

Regards,
Paul.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help