Thread (8 messages) 8 messages, 3 authors, 2017-11-17

Re: [PATCH] MIPS: Cleanup R10000_LLSC_WAR logic in atomic.h

From: Maciej W. Rozycki <hidden>
Date: 2017-11-14 17:53:30

On Mon, 6 Nov 2017, Joshua Kinard wrote:
quoted
quoted
@@ -563,17 +464,17 @@ static __inline__ long atomic64_sub_if_positive(long i, atomic64_t * v)

     smp_mb__before_llsc();

-    if (kernel_uses_llsc && R10000_LLSC_WAR) {
+    if (kernel_uses_llsc) {
             long temp;

             __asm__ __volatile__(
-            "       .set    arch=r4000                              \n"
+            "       .set    "MIPS_ISA_LEVEL"                        \n"
             "1:     lld     %1, %2          # atomic64_sub_if_positive\n"
             "       dsubu   %0, %1, %3                              \n"
             "       bltz    %0, 1f                                  \n"
             "       scd     %0, %2                                  \n"
             "       .set    noreorder                               \n"
-            "       beqzl   %0, 1b                                  \n"
+            "\t" __scbeqz " %0, 1b                                  \n"
             "        dsubu  %0, %1, %3                              \n"
             "       .set    reorder                                 \n"
             "1:                                                     \n"
 This is obviously a preexisting bug, which has only been made more
obvious with your merge of this code, but this can't be right, because the
final DSUBU instruction which calculates `result', later returned, in %0
will not be executed with BEQZL in the fall-through case, however it will
be in the BEQZ case.
Oh, I see.  Because the subu/dsubu hides in the branch-delay slot, and thus, in
a branch-likely case, gets skipped?
 AFAICT the only reason for the addition in the delay slot is to replay 
the operation made on the value fetched with LLD, recalculating the result 
of the subtraction lost due to SCD reusing the same register for its 
result.  And it only matters for the fall-through case, because otherwise 
the subtraction result is immediately overwritten by the value fetched by 
LLD on the next iteration.

 Actually I overlooked how SCD handles its source again and made a bug in 
the replacement code proposed, sorry.  See below for an update.
quoted
 It may make sense to split the temporary set by SCD from the result and
then the final DSUBU won't be needed at all as the result will have
already been calculated by the first DSUBU.  In fact ISTM %1 can simply be
used, i.e.:

              __asm__ __volatile__(
              "       .set    "MIPS_ISA_LEVEL"                        \n"
              "1:     lld     %1, %2          # atomic64_sub_if_positive\n"
              "       dsubu   %0, %1, %3                              \n"
              "       bltz    %0, 1f                                  \n"
              "       scd     %1, %2                                  \n"
              "\t" __scbeqz " %1, 1b                                  \n"
              "1:                                                     \n"
              "       .set    mips0                                   \n"
              : "=&r" (result), "=&r" (temp),
                "+" GCC_OFF_SMALL_ASM() (v->counter)
              : "Ir" (i));

making this short and sweet, also avoiding the nasty `noreorder' piece and
therefore making it one place less to concern about with porting Linux to
the microMIPSr6 ISA.

 I haven't gone through the remaining updates; it's just this piece that
has caught my eye as I glanced over your change.
I rolled these changes (and James' earlier suggestion to use the non-memory
barrier bottom bits) into a v2 patch that I just sent in (also for the
non-64bit case using subu).  I haven't tested the changes on my Octane yet, but
am assuming it'll work.  Saving an instruction or two on a heavily-executed bit
of code will likely yield some small benefits.
 So we do have to make the result of DSUBU available to SCD and I propose 
to make an obvious update to the piece of code previously posted, which 
still does not require `noreorder' hacks:

		__asm__ __volatile__(
		"	.set	"MIPS_ISA_LEVEL"			\n"
		"1:	lld	%1, %2		# atomic64_sub_if_positive\n"
		"	dsubu	%0, %1, %3				\n"
		"	move	%1, %0					\n"
		"	bltz	%0, 1f					\n"
		"	scd	%1, %2					\n"
		"\t" __scbeqz "	%1, 1b					\n"
		"1:							\n"
		"	.set	mips0					\n"
		: "=&r" (result), "=&r" (temp),
		  "+" GCC_OFF_SMALL_ASM() (v->counter)
		: "Ir" (i));

and uses the same instruction count, in terms of both the code size and 
the number of instructions actually executed, because previously BLTZ 
would have its delay slot filled with a NOP and now the MOVE instruction 
will go there.

  Maciej
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help