Re: ppc44x - how do i optimize driver for tlb hits | linuxppc-dev

Re: ppc44x - how do i optimize driver for tlb hits

From: Josh Boyer <hidden>
Date: 2010-09-24 10:30:41

On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote:

quoted

The DMA is what I use in the "real world case" to get data into and out 
of these buffers.  However, I can disable the DMA completely and do only
the kmalloc.  In this case I still see the same poor performance.  My
prefetching is part of my algo using the dcbt instructions.  I know the
instructions are effective b/c without them the algo is much less 
performant.  So yes, my prefetches are explicit.

Could be some "effect" of the cache structure, L2 cache, cache geometry
(number of ways etc...). You might be able to alleviate that by changing
the "stride" of your prefetch.

Unfortunately, I'm not familiar enough with the 440 micro architecture
and its caches to be able to help you much here.

Also, doesn't kmalloc have a limit to the size of the request it will
let you allocate?  I know in the distant past you could allocate 128K
with kmalloc, and 2M with an explicit call to get_free_pages.  Anything
larger than that had to use vmalloc.  The limit might indeed be higher
now, but a 4MB kmalloc buffer sounds very large, given that it would be
contiguous pages.  Two of them even less so.

quoted

Ok, I will give that a try ... in addition, is there an easy way to use
any sort of gprof like tool to see the system performance?  What about
looking at the 44x performance counters in some meaningful way?  All
the experiments point to the fetching being slower in the full program
as opposed to the algo in a testbench, so I want to determine what it is
that could cause that.

Does it have any useful performance counters ? I didn't think it did but
I may be mistaken.

No, it doesn't.

josh

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help