Re: ppc44x - how do i optimize driver for tlb hits
From: Josh Boyer <hidden>
Date: 2010-09-24 10:30:41
On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote:
quoted
The DMA is what I use in the "real world case" to get data into and out of these buffers. However, I can disable the DMA completely and do only the kmalloc. In this case I still see the same poor performance. My prefetching is part of my algo using the dcbt instructions. I know the instructions are effective b/c without them the algo is much less performant. So yes, my prefetches are explicit.Could be some "effect" of the cache structure, L2 cache, cache geometry (number of ways etc...). You might be able to alleviate that by changing the "stride" of your prefetch. Unfortunately, I'm not familiar enough with the 440 micro architecture and its caches to be able to help you much here.
Also, doesn't kmalloc have a limit to the size of the request it will let you allocate? I know in the distant past you could allocate 128K with kmalloc, and 2M with an explicit call to get_free_pages. Anything larger than that had to use vmalloc. The limit might indeed be higher now, but a 4MB kmalloc buffer sounds very large, given that it would be contiguous pages. Two of them even less so.
quoted
Ok, I will give that a try ... in addition, is there an easy way to use any sort of gprof like tool to see the system performance? What about looking at the 44x performance counters in some meaningful way? All the experiments point to the fetching being slower in the full program as opposed to the algo in a testbench, so I want to determine what it is that could cause that.Does it have any useful performance counters ? I didn't think it did but I may be mistaken.
No, it doesn't. josh