Re: [PATCH] MIPS: Cleanup R10000_LLSC_WAR logic in atomic.h
From: Maciej W. Rozycki <hidden>
Date: 2017-11-17 00:32:32
On Tue, 14 Nov 2017, Joshua Kinard wrote:
quoted
So we do have to make the result of DSUBU available to SCD and I propose to make an obvious update to the piece of code previously posted, which still does not require `noreorder' hacks: __asm__ __volatile__( " .set "MIPS_ISA_LEVEL" \n" "1: lld %1, %2 # atomic64_sub_if_positive\n" " dsubu %0, %1, %3 \n" " move %1, %0 \n" " bltz %0, 1f \n" " scd %1, %2 \n" "\t" __scbeqz " %1, 1b \n" "1: \n" " .set mips0 \n" : "=&r" (result), "=&r" (temp), "+" GCC_OFF_SMALL_ASM() (v->counter) : "Ir" (i)); and uses the same instruction count, in terms of both the code size and the number of instructions actually executed, because previously BLTZ would have its delay slot filled with a NOP and now the MOVE instruction will go there.I thought the delay slot for branch instructions came after the instruction? Shouldn't the "move" go after "bltz", or is this a case where "bltz" has its delay slot before?
In terms of the program flow you want the move operation ahead of the branch. Of course we have the architectural peculiarity of branch delay slots, but in the MIPS assembly language they have been intended to be generally hidden from the programmer. This is why the original MIPSCO assembler scheduled delay slots, which included branch delay slots, but also load delay slots, coprocessor transfer delay slots, etc., where the given architecture level had them, so that the programmer could write code as if there were no delay slots and let the tool optimise it. And GAS normally does that as well, where possible by swapping a branch with the preceding instruction, or otherwise, i.e. where there is a data dependency or the instruction is not allowed in a delay slot for some reason, by inserting a NOP. GAS also has a way to disable delay slot scheduling, with the `.set noreorder' pseudo-op (with `.set reorder' undoing the effect). In this mode of operation, sometimes called the `noreorder' mode, GAS assembles source code as it is and it is the programmer's responsibility to satisfy various pipeline requirements, including branch delay slots in particular, i.e. an instruction has to be put there explicitly, even if it is a NOP. This mode will typically be used by a compiler for its assembly output. This mode is not recommended to use in handcoded assembly unless specifically required for a good reason. One such reason is optimising code where there is a data dependency between a branch and its delay-slot instruction, e.g.: .set noreorder bnez $2, 0b addiu $2, -1 .set reorder which avoids clobbering a temporary register and is smaller/quicker then a functional equivalent that does not schedule the delay slot manually like: move $3, $2 addiu $2, -1 bnez $3, 0b (in this case GAS will move the ADDIU instruction into the delay slot of the BNEZ instruction, because there is no data dependency between the two instructions). And of course if you just write: addiu $2, -1 bnez $2, 0b then the semantics of this sequence will be different from the first one above, because the addition precedes rather than following the branch and $2 is a data dependency between the two instructions. This data dependency will also make GAS put a NOP into the delay slot of the BNEZ instruction instead of moving ADDIU there. So in the piece I proposed at the top GAS will move the MOVE instruction into the delay slot of BLTZ -- that is unless it is in a dumb mode where it always uses NOPs, as recently discovered in the discussion here: <https://www.linux-mips.org/cgi-bin/mesg.cgi?a=linux-mips&i=alpine.DEB.2.00.1711151425100.3893%40tp.orcam.me.uk>. But that's just been a bug in our build system, which will hopefully be fixed soon; we ought to assume smart scheduling normally (delay-slot scheduling makes debugging trickier sometimes, in which case using the dumb mode can help).
And I am safe in assuming the same change also applies for the non-64-bit case earlier that uses sc and subu? The changes to the rest of the code also look good? Don't want to miss any other quirky bugs :)
Yeah, `atomic_sub_if_positive' suffers from the same problem, and the rest looks good to me; also none of the other pieces of code uses the `noreorder' mode. Maciej