RE: linux interprets an fcntl int arg as long

From: David Laight <hidden>
Date: 2022-11-01 12:20:02
Also in: linux-arm-kernel, lkml

From: 'Szabolcs Nagy' <redacted>

Sent: 01 November 2022 11:45

The 11/01/2022 10:02, David Laight wrote:

quoted

From: Szabolcs Nagy

quoted

Sent: 01 November 2022 09:11

The 10/31/2022 21:46, Theodore Ts'o wrote:

quoted

On Mon, Oct 31, 2022 at 12:44:59PM +0000, Szabolcs Nagy wrote:

quoted

and such fcntl call can happen with c code that just passes
F_SEAL_WRITE since it is an int and e.g. with aarch64 pcs rules
it is passed in a register where top bits can be non-zero
(unlikely in practice but valid).

In Linux's aarch64 ABI, an int is a 4-byte value.  It is *not* an
8-byte value.  So passing in "F_SEAL_WRITE | 0xF00000000" as an int
(as in your example) is simply not valid thing for the userspace
program to do.

Now, if there is a C program which has "int c = F_SEAL_WRITE", if the
PCS allows the compiler to pass a function paramter c --- for example
f(a, b, c) --- where the 4-byte paramter 'c' is placed in a 64-bit
register where the high bits of the 64-bit register contains non-zero
garbage values, I would argue that this is a bug in the PCS and/or the
compiler.

the callee uses va_arg(ap, type) to get the argument,
and if the type is wider than what was actually passed
then anything can happen. in practice what happens is
that the top bits can be non-zero.

many pcs are affected (aarch64 is the one i know well,
but at least x86_64, arm are affected too). and even if
it was aarch64 pcs only, it is incompetent to say that
the pcs is wrong: that's a constraint we are working with.

the kernel must not read a wider type than what it
documents as argument to variadic functions in the c api.
(it does not make much sense to expect anything there
anyway, but it can break userspace)

The Linux kernel just assumes that the varargs call looks like
a non-varags call with the same parameters.
(It doesn't use va_arg())
All syscall arguments are passed in registers (unlike BSDs
where they can also be on the user stack).
On 64bit systems the same registers are expected to be used
for 64bit and 32bit integers and for pointers.
32bit values usually get masked because they get passed to
a function with an 'int' argument.

If any fcntl() calls require a 64bit value and the C ABI
might leave non-zero high bits in an register containing
a 32bit value (esp. to a varargs function) then the calling
code will need to cast such arguments to 64 bits.

the entire point of my mail is that it is not possible
to tell in the libc if the vararg is pointer or int.

so in case a user passed an int, the libc cannot fix
that up, like it usually does for other cases where
linux syscall abi is incompatible with the c api.

let me go through step by step what is going on:

...

long internal_syscall(int, long, long, long, long, long, long);
int fcntl(int fd, int cmd, ...)
{
	va_list ap;
	va_start(ap, cmd);
	/* this is non-conforming C: wrong type in va_arg,
	   but that's not relevant since libc can implement
	   this as target specific asm, the important bit is
	   that the correct type is not known: libc cannot
	   replicate the kernel side dispatch logic because
	   new cmd can be introduced in the future with
	   arbitrary type arg.

	   top 32bits of arg are non-zero, libc cannot
	   zero them here as arg may be long or pointer.  */
	long arg = va_arg(ap, long);

Here libc has to assume that int, long and pointer are passed
the same way.
This is true for everything Linux actually runs on.
But wouldn't be if, for example, a 64bit arch just pushed
32bit arguments on stack.
Or if a 32 bit one passed integer and pointer args in different
types of registers.
But all 64bit ones use the same GP registers for int/long/pointer.

kernel code:
------------
SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{

That is just a wrapper and calls do_fcntl().
which needs changing to be add:
	arg &= ~0U;
before the switch(cmd) {

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help