Re: [PATCH] MIPS: Cleanup R10000_LLSC_WAR logic in atomic.h

From: Maciej W. Rozycki <hidden>
Date: 2017-11-17 00:32:32

On Tue, 14 Nov 2017, Joshua Kinard wrote:

quoted

 So we do have to make the result of DSUBU available to SCD and I propose 
to make an obvious update to the piece of code previously posted, which 
still does not require `noreorder' hacks:

		__asm__ __volatile__(
		"	.set	"MIPS_ISA_LEVEL"			\n"
		"1:	lld	%1, %2		# atomic64_sub_if_positive\n"
		"	dsubu	%0, %1, %3				\n"
		"	move	%1, %0					\n"
		"	bltz	%0, 1f					\n"
		"	scd	%1, %2					\n"
		"\t" __scbeqz "	%1, 1b					\n"
		"1:							\n"
		"	.set	mips0					\n"
		: "=&r" (result), "=&r" (temp),
		  "+" GCC_OFF_SMALL_ASM() (v->counter)
		: "Ir" (i));

and uses the same instruction count, in terms of both the code size and 
the number of instructions actually executed, because previously BLTZ 
would have its delay slot filled with a NOP and now the MOVE instruction 
will go there.

I thought the delay slot for branch instructions came after the instruction?
Shouldn't the "move" go after "bltz", or is this a case where "bltz" has its
delay slot before?

In terms of the program flow you want the move operation ahead of the
branch.

Of course we have the architectural peculiarity of branch delay slots,
but in the MIPS assembly language they have been intended to be generally
hidden from the programmer. This is why the original MIPSCO assembler
scheduled delay slots, which included branch delay slots, but also load
delay slots, coprocessor transfer delay slots, etc., where the given
architecture level had them, so that the programmer could write code as if
there were no delay slots and let the tool optimise it.

And GAS normally does that as well, where possible by swapping a branch
with the preceding instruction, or otherwise, i.e. where there is a data
dependency or the instruction is not allowed in a delay slot for some
reason, by inserting a NOP.

GAS also has a way to disable delay slot scheduling, with the `.set
noreorder' pseudo-op (with `.set reorder' undoing the effect). In this
mode of operation, sometimes called the `noreorder' mode, GAS assembles
source code as it is and it is the programmer's responsibility to satisfy
various pipeline requirements, including branch delay slots in particular,
i.e. an instruction has to be put there explicitly, even if it is a NOP.

This mode will typically be used by a compiler for its assembly output.
This mode is not recommended to use in handcoded assembly unless
specifically required for a good reason. One such reason is optimising
code where there is a data dependency between a branch and its delay-slot
instruction, e.g.:

.set noreorder
bnez $2, 0b
addiu $2, -1
.set reorder

which avoids clobbering a temporary register and is smaller/quicker then a
functional equivalent that does not schedule the delay slot manually like:

move $3, $2
addiu $2, -1
bnez $3, 0b

(in this case GAS will move the ADDIU instruction into the delay slot of
the BNEZ instruction, because there is no data dependency between the
two instructions). And of course if you just write:

addiu $2, -1
bnez $2, 0b

then the semantics of this sequence will be different from the first one
above, because the addition precedes rather than following the branch and
$2 is a data dependency between the two instructions. This data
dependency will also make GAS put a NOP into the delay slot of the BNEZ
instruction instead of moving ADDIU there.

So in the piece I proposed at the top GAS will move the MOVE instruction
into the delay slot of BLTZ -- that is unless it is in a dumb mode where
it always uses NOPs, as recently discovered in the discussion here:
<https://www.linux-mips.org/cgi-bin/mesg.cgi?a=linux-mips&i=alpine.DEB.2.00.1711151425100.3893%40tp.orcam.me.uk>.
But that's just been a bug in our build system, which will hopefully be
fixed soon; we ought to assume smart scheduling normally (delay-slot
scheduling makes debugging trickier sometimes, in which case using the
dumb mode can help).

And I am safe in assuming the same change also applies for the non-64-bit case
earlier that uses sc and subu?  The changes to the rest of the code also look
good?  Don't want to miss any other quirky bugs :)

 Yeah, `atomic_sub_if_positive' suffers from the same problem, and the 
rest looks good to me; also none of the other pieces of code uses the 
`noreorder' mode.

  Maciej

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help