Re: powerpc Linux scv support and scv system call ABI proposal

From: Adhemerval Zanella <hidden>
Date: 2020-01-28 17:28:28


On 28/01/2020 11:05, Nicholas Piggin wrote:

Florian Weimer's on January 28, 2020 11:09 pm:

quoted

* Nicholas Piggin:

quoted

* Proposal is for PPC_FEATURE2_SCV to indicate 'scv 0' support, all other
  vectors will return -ENOSYS, and the decision for how to add support for
  a new vector deferred until we see the next user.

Seems reasonable.  We don't have to decide this today.

quoted

* Proposal is for scv 0 to provide the standard Linux system call ABI with some
  differences:

- LR is volatile across scv calls. This is necessary for support because the
  scv instruction clobbers LR.

I think we can express this in the glibc system call assembler wrapper
generators.  The mcount profiling wrappers already have this property.

But I don't think we are so lucky for the inline system calls.  GCC
recognizes an "lr" clobber with inline asm (even though it is not
documented), but it generates rather strange assembler output as a
result:

long
f (long x)
{
  long y;
  asm ("#" : "=r" (y) : "r" (x) : "lr");
  return y;
}

	.abiversion 2
	.section	".text"
	.align 2
	.p2align 4,,15
	.globl f
	.type	f, @function
f:
.LFB0:
	.cfi_startproc
	mflr 0
	.cfi_register 65, 0
#APP
 # 5 "t.c" 1
	#
 # 0 "" 2
#NO_APP
	std 0,16(1)
	.cfi_offset 65, 16
	ori 2,2,0
	ld 0,16(1)
	mtlr 0
	.cfi_restore 65
	blr
	.long 0
	.byte 0,0,0,1,0,0,0,0
	.cfi_endproc
.LFE0:
	.size	f,.-f


That's with GCC 8.3 at -O2.  I don't understand what the ori is about.

ori 2,2,0 is the group terminating nop hint for POWER8 type cores
which had dispatch grouping rules.

It worth to note that it aims to mitigate a load-hit-store cpu stall
on some powerpc chips.

quoted

I don't think we can save LR in a regular register around the system
call, explicitly in the inline asm statement, because we still have to
generate proper unwinding information using CFI directives, something
that you cannot do from within the asm statement.

Supporting this in GCC should not be impossible, but someone who
actually knows this stuff needs to look at it.

The generated assembler actually seems okay to me. If we compile
something like a syscall and with -mcpu=power9:

long
f (long _r3, long _r4, long _r5, long _r6, long _r7, long _r8, long _r0)
{
  register long r0 asm ("r0") = _r0;
  register long r3 asm ("r3") = _r3;
  register long r4 asm ("r4") = _r4;
  register long r5 asm ("r5") = _r5;
  register long r6 asm ("r6") = _r6;
  register long r7 asm ("r7") = _r7;
  register long r8 asm ("r8") = _r8;

  asm ("# scv" : "=r"(r3) : "r"(r0), "r"(r4), "r"(r5), "r"(r6), "r"(r7), "r"(r8) : "lr", "ctr", "cc", "xer");

  return r3;
}


f:
.LFB0:
        .cfi_startproc
        mflr 0
        std 0,16(1)
        .cfi_offset 65, 16
        mr 0,9
#APP
 # 12 "a.c" 1
        # scv
 # 0 "" 2
#NO_APP
        ld 0,16(1)
        mtlr 0
        .cfi_restore 65
        blr
        .long 0
        .byte 0,0,0,1,0,0,0,0
        .cfi_endproc

That gets the LR save/restore right when we're also using r0.

quoted

- CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
  system call exit to avoid restoring the CR register.

This sounds reasonable, but I don't know what kind of knock-on effects
this has.  The inline system call wrappers can handle this with minor
tweaks.

Okay, good. In the end we would have to check code trace through the
kernel and libc of course, but I think there's little to no opportunity
to take advantage of current extra non-volatile cr regs.

mtcr has to write 8 independently renamed registers so it's cracked into
2 insns on POWER9 (and likely to always be a bit troublesome). It's not
much in the scheme of a system call, but while we can tweak the ABI...

We don't really need a mfcr/mfocr to implement the Linux syscall ABI on
powerpc, we can use a 'bns+' plus a neg instead as:

--
#define internal_syscall6(name, err, nr, arg1, arg2, arg3, arg4, arg5,  \
                          arg6)                                         \
  ({                                                                    \
    register long int r0  __asm__ ("r0") = (long int) (name);           \
    register long int r3  __asm__ ("r3") = (long int) (arg1);           \
    register long int r4  __asm__ ("r4") = (long int) (arg2);           \
    register long int r5  __asm__ ("r5") = (long int) (arg3);           \
    register long int r6  __asm__ ("r6") = (long int) (arg4);           \
    register long int r7  __asm__ ("r7") = (long int) (arg5);           \
    register long int r8  __asm__ ("r8") = (long int) (arg6);           \
    __asm__ __volatile__                                                \
      ("sc\n\t"                                                         \
       "bns+ 1f\n\t"                                                    \
       "neg %1, %1\n\t"                                                 \
       "1:\n\t"                                                         \
       : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5), "+r" (r6),         \
         "+r" (r7), "+r" (r8)                                           \
       :                                                                \
       : "r9", "r10", "r11", "r12",                                     \
         "cr0", "memory");                                              \
    r3;                                                                 \
  })
--

And change INTERNAL_SYSCALL_ERROR_P to check for the expected invalid
range (((unsigned long) (val) >= (unsigned long) -4095)) and 
INTERNAL_SYSCALL_ERRNO to return a negative value (since the value will
be negated by INTERNAL_SYSCALL_ERROR_P).

The powerpc kernel ABI to use a different constraint to signal error
also requires glibc to reimplement the vDSO symbol call to be arch
specific instead a straight function call (since it might fallbacks
to a syscall).

Even for POWER-specific system call that uses all result bits, either
it should not fail or it would require a arch-specific implementation
to setup the expected error value (since the information would require
another source or a pre-defined value). 

In fact I think we make the assumption that INTERNAL_SYSCALL returns
a negative errno value in case or an error and make all the handling
to check for a syscall failure and errno setting generic. This will
required change ia64, mips, nios2, and sparc though.

quoted

- Error handling: use of CR0[SO] to indicate error requires a mtcr / mtocr
  instruction on the kernel side, and it is currently not implemented well
  in glibc, requiring a mfcr (mfocr should be possible and asm goto support
  would allow a better implementation). Is it worth continuing this style of
  error handling? Or just move to -ve return means error? Using a different
  bit would allow the kernel to piggy back the CR return code setting with
  a test for the error case exit.

GCC does not model the condition registers, so for inline system calls,
we have to produce a value anyway that the subsequence C code can check.
The assembler syscall wrappers do not need to do this, of course, but
I'm not sure which category of interfaces is more important.

Right. asm goto can improve this kind of pattern if it's inlined
into the C code which tests the result, it can branch using the flags
to the C error handling label, rather than move flags into GPR, test
GPR, branch. However...

quoted

But the kernel uses the -errno convention internally, so I think it
would make sense to pass this to userspace and not convert back and
forth.  This would align with what most of the architectures do, and
also avoids the GCC oddity.

Yes I would be interested in opinions for this option. It seems like
matching other architectures is a good idea. Maybe there are some
reasons not to.

quoted

- Should this be for 64-bit only? 'scv 1' could be reserved for 32-bit
  calls if there was interest in developing an ABI for 32-bit programs.
  Marginal benefit in avoiding compat syscall selection.

We don't have an ELFv2 ABI for 32-bit.  I doubt it makes sense to
provide an ELFv1 port for this given that it's POWER9-specific.

Okay. There's no reason not to enable this for BE, at least for the
kernel it's no additional work so it probably remains enabled (unless
there is something really good we could do with the ABI if we exclude
ELFv1 but I don't see anything).

But if glibc only builds for ELFv2 support that's probably reasonable.

quoted

From the glibc perspective, the major question is how we handle run-time
selection of the system call instruction sequence.  On i386, we use a
function pointer in the TCB to call an instruction sequence in the vDSO.
That's problematic from a security perspective.  I expect that on
POWER9, using a pointer in read-only memory would be equally
non-attractive due to a similar lack of PC-relative addressing.  We
could use the HWCAP bit in the TCB, but that would add another (easy to
predict) conditional branch to every system call.

I would have to defer to glibc devs on this. Conditional branch
should be acceptable I think, scv improves speed as much as several
mispredicted branches (about 90 cycles).

quoted

I don't think it matters whether both system call variants use the same
error convention because we could have different error code extraction
code on the two branches.

That's one less difficulty.

We already had to push a similar hack where glibc used to abort transactions
prior syscalls to avoid some side-effects on kernel (commit 56cf2763819d2f).
It was eventually removed from syscall handling by f0458cf4f9ff3d870, where
we only enable TLE if kernel suppors PPC_FEATURE2_HTM_NOSC.

The transaction syscall abort used to read a variable directly from TCB,
so this could be an option. I would expect that we could optimize it where
if glibc is building against a recent kernel and compiler is building
for a ISA 3.0+ cpu we could remove the 'sc' code.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help