Thread (3 messages) 3 messages, 3 authors, 2004-12-13

Re: Write Combining on PowerPC

From: Lawrence E. Bakst <hidden>
Date: 2004-12-13 08:53:31

At 1:23 PM -0800 12/10/04, Kendall Bennett wrote:
Hi Guys,

We are working on some PowerPC machines and noticed that the boxes don't
appear to support the equivalent of Write Combining that we get on x86
boxes. Copies to Video Memory on our Motorola Sandpoint box run about
10Mb/s, which is terribly, terribly slow! 

Does anyone know if it is possible to do something similar to Write
Combining for the PowerPC architecture, to speed up CPU access to the
linear framebuffer? Part of the problem is that for video overlay support
(not motion compensation) you have to dump the entire YUV frame into
video memory for the hardware overlay, and even on a 1GHz PPC box playing
an MPEG2 stream is not possible as X takes up over 80% of the CPU just to
copy the YUV data to video memory!

1. As a previous poster mentioned many PPCs have write combining but they usually call it store gathering. I was just reading about it in the IBM 970fx.

2. What you need are cache line reads or writes through your bridge to the video memory.

3. If your frame buffer is marked non-cachable, which is the usually case, see if you can set up a second aperture that is cached. Otherwise I don't think the store gatherin will work. I don't know your board or processor but you should experiment with cache modes to see which if any work best.

4. Assuming you can get a cachable aperture you need to remember when writing a complete image to frame buffer memory is that you waste 50% of your bandwidth reading cache lines from the frame buffer into your cache. You can use dcbz to clear a cache line and then write it. This should double your bandwidth to 20 MB/sec.

5. How good is your copy loop? if you have floating point registers you can often use these to increase your efficiency. There may be other ways to make the copy loop more efficient using processor specific instructions that generate more efficient memory loads and stores. Try loop unrolling. Also make sure you prefetch the source using a dcbt or similar instruction. You have to experiment to see how far ahead of needed the data you need to prefecth.

6. Use small test programs to get it right.

7. You don't mention your processor type/speed, bus speeds and memory speed so it's pretty hard to tell what efficiency you might be able to achieve.

8. I make no comment about the efficiency of X. It's not would I would use for video applications although I am sure there are those that have hacked it work there.

Best,

leb

Obviously bus mastering will help solve this problem, but it would be
better if there was a way to enabling faster CPU access to the
framebuffer as well. 

Regards,

---
Kendall Bennett
Chief Executive Officer
SciTech Software, Inc.
Phone: (530) 894 8400
http://www.scitechsoft.com

~ SciTech SNAP - The future of device driver technology! ~


_______________________________________________
Linuxppc-embedded mailing list
Linuxppc-embedded@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-embedded
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help