Re: [RFC PATCH 0/7] powerpc/64s/radix TLB flush performance improvements
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2017-11-02 03:28:26
On Thu, 2 Nov 2017 08:49:49 +0530 Anshuman Khandual [off-list ref] wrote:
On 11/01/2017 07:09 PM, Nicholas Piggin wrote:quoted
On Wed, 1 Nov 2017 17:35:51 +0530 Anshuman Khandual [off-list ref] wrote:quoted
On 10/31/2017 12:14 PM, Nicholas Piggin wrote:quoted
Here's a random mix of performance improvements for radix TLB flushing code. The main aims are to reduce the amount of translation that gets invalidated, and to reduce global flushes where we can do local. To that end, a parallel kernel compile benchmark using powerpc:tlbie tracepoint shows a reduction in tlbie instructions from about 290,000 to 80,000, and a reduction in tlbiel instructions from 49,500,000 to 15,000,000. Looks great, but unfortunately does not translate to a statistically significant performance improvement! The needle on TLB misses does not move much, I suspect because a lot of the flushing is done a startup and shutdown, and because a significant cost of TLB flushing itself is in the barriers.Does memory barrier initiate a single global invalidation with tlbie ?I'm not quite sure what you're asking, and I don't know the details of how the hardware handles it, but from the measurements in patch 1 of the series we can see there is a benefit for both tlbie and tlbiel of batching them up between barriers.Ahh, I might have got the statement "a significant cost of TLB flushing itself is in the barriers" wrong. I guess you were mentioning about the total cost of multiple TLB flushes with memory barriers in between each of them which is causing the high execution cost. This got reduced by packing multiple tlbie(l) instruction between a single memory barrier.
Yes that did get reduced for the va range flush in my patches. However the big reduction in the number of tlbiel calls came from more use of range flushes and fewer use of PID flushes. But the PID flushes already have such optimization. Therefore despite tlbiel being reduced, the number of barriers probably has not gone down a great deal on this workload, which may explain why performance numbers are basically in the noise. Thanks, Nick