Re: [PATCH] MIPS: Cleanup R10000_LLSC_WAR logic in atomic.h
From: Maciej W. Rozycki <hidden>
Date: 2017-11-14 17:53:30
On Mon, 6 Nov 2017, Joshua Kinard wrote:
quoted
quoted
@@ -563,17 +464,17 @@ static __inline__ long atomic64_sub_if_positive(long i, atomic64_t * v) smp_mb__before_llsc(); - if (kernel_uses_llsc && R10000_LLSC_WAR) { + if (kernel_uses_llsc) { long temp; __asm__ __volatile__( - " .set arch=r4000 \n" + " .set "MIPS_ISA_LEVEL" \n" "1: lld %1, %2 # atomic64_sub_if_positive\n" " dsubu %0, %1, %3 \n" " bltz %0, 1f \n" " scd %0, %2 \n" " .set noreorder \n" - " beqzl %0, 1b \n" + "\t" __scbeqz " %0, 1b \n" " dsubu %0, %1, %3 \n" " .set reorder \n" "1: \n"This is obviously a preexisting bug, which has only been made more obvious with your merge of this code, but this can't be right, because the final DSUBU instruction which calculates `result', later returned, in %0 will not be executed with BEQZL in the fall-through case, however it will be in the BEQZ case.Oh, I see. Because the subu/dsubu hides in the branch-delay slot, and thus, in a branch-likely case, gets skipped?
AFAICT the only reason for the addition in the delay slot is to replay the operation made on the value fetched with LLD, recalculating the result of the subtraction lost due to SCD reusing the same register for its result. And it only matters for the fall-through case, because otherwise the subtraction result is immediately overwritten by the value fetched by LLD on the next iteration. Actually I overlooked how SCD handles its source again and made a bug in the replacement code proposed, sorry. See below for an update.
quoted
It may make sense to split the temporary set by SCD from the result and then the final DSUBU won't be needed at all as the result will have already been calculated by the first DSUBU. In fact ISTM %1 can simply be used, i.e.: __asm__ __volatile__( " .set "MIPS_ISA_LEVEL" \n" "1: lld %1, %2 # atomic64_sub_if_positive\n" " dsubu %0, %1, %3 \n" " bltz %0, 1f \n" " scd %1, %2 \n" "\t" __scbeqz " %1, 1b \n" "1: \n" " .set mips0 \n" : "=&r" (result), "=&r" (temp), "+" GCC_OFF_SMALL_ASM() (v->counter) : "Ir" (i)); making this short and sweet, also avoiding the nasty `noreorder' piece and therefore making it one place less to concern about with porting Linux to the microMIPSr6 ISA. I haven't gone through the remaining updates; it's just this piece that has caught my eye as I glanced over your change.I rolled these changes (and James' earlier suggestion to use the non-memory barrier bottom bits) into a v2 patch that I just sent in (also for the non-64bit case using subu). I haven't tested the changes on my Octane yet, but am assuming it'll work. Saving an instruction or two on a heavily-executed bit of code will likely yield some small benefits.
So we do have to make the result of DSUBU available to SCD and I propose to make an obvious update to the piece of code previously posted, which still does not require `noreorder' hacks: __asm__ __volatile__( " .set "MIPS_ISA_LEVEL" \n" "1: lld %1, %2 # atomic64_sub_if_positive\n" " dsubu %0, %1, %3 \n" " move %1, %0 \n" " bltz %0, 1f \n" " scd %1, %2 \n" "\t" __scbeqz " %1, 1b \n" "1: \n" " .set mips0 \n" : "=&r" (result), "=&r" (temp), "+" GCC_OFF_SMALL_ASM() (v->counter) : "Ir" (i)); and uses the same instruction count, in terms of both the code size and the number of instructions actually executed, because previously BLTZ would have its delay slot filled with a NOP and now the MOVE instruction will go there. Maciej