Thread (8 messages) 8 messages, 4 authors, 2021-02-19

Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack

From: Andy Lutomirski <luto@kernel.org>
Date: 2021-02-19 00:29:34
Also in: kvm, lkml, virtualization

On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel [off-list ref] wrote:
On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
quoted
I don't understand what this means.  The whole entry mechanism on x86
is structured so that we call a C function *and return from that C
function without longjmp-like magic* with the sole exception of
unwind_stack_do_exit().  This means that you can match up enters and
exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
the former case, it's normal C and we can use normal local variables.
In the latter case, we know exactly what state we're trying to restore
and we can restore it directly without any linked lists or similar.
Okay, the unwinder will likely get confused by this logic.
quoted
What do you have in mind that requires a linked list?
Cases when there are multiple IST vectors besides NMI that can hit while
the #VC handler is still on its own IST stack. #MCE comes to mind, but
that is broken anyway. At some point #VC itself will be one of them, but
when that happens the code will kill the machine.
This leaves #HV in the list, and I am not sure how that is going to be
handled yet. I think the goal is that the #HV handler is not allowed to
cause any #VC exception. In that case the linked-list logic will not be
needed.
Can you give me an example, even artificial, in which the linked-list
logic is useful?
quoted
quoted
I don't see how this would break, can you elaborate?

What I think happens is:

SYSCALL gap (RSP is from userspace and untrusted)

        -> #VC - Handler on #VC IST stack detects that it interrupted
           the SYSCALL gap and switches to the task stack.
Can you point me to exactly what code you're referring to?  I spent a
while digging through the code and macro tangle and I can't find this.
See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
creates the assembly code for the handler. At some point it calls
vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
This function tries to find a new stack for the #VC handler.

The first thing it does is checking whether the exception came from the
SYSCALL gap and just uses the task stack in that case.

Then it will check for other kernel stacks which are safe to switch
to. If that fails it uses the fall-back stack (VC2), which will direct
the handler to a separate function which, for now, just calls panic().
Not safe are the entry or unknown stacks.
Can you explain your reasoning in considering the entry stack unsafe?
It's 4k bytes these days.

You forgot about entry_SYSCALL_compat.

Your 8-byte alignment is confusing to me.  In valid kernel code, SP
should be 8-byte-aligned already, and, if you're trying to match
architectural behavior, the CPU aligns to 16 bytes.

We're not robust against #VC, NMI in the #VC prologue before the magic
stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
have any detection of the case where #VC nests directly inside its own
prologue.  Or did I miss something else here?

If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
looks like it will merrily run itself in the NMI special-stack-layout
section, and that sounds really quite bad.
The function then copies pt_regs and returns the new stack pointer to
assembly code, which then writes it to %RSP.
quoted
Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
that #DB is *not* always called in safe places.
You are right, forgot about this. The MOV SS bug can very well
trigger a #VC(#DB) exception from the syscall gap.
quoted
quoted
And with SNP we need to be able to at least detect a malicious HV so we
can reliably kill the guest. Otherwise the HV could potentially take
control over the guest's execution flow and make it reveal its secrets.
True.  But is the rest of the machinery to be secure against EFLAGS.IF
violations and such in place yet?
Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
anywhere in its call-path. If that ever happens it is a bug.
I mean that, IIRC, a malicious hypervisor can inject inappropriate
vectors at inappropriate times if the #HV mechanism isn't enabled.
For example, it could inject a page fault or an interrupt in a context
in which we have the wrong GSBASE loaded.

But the #DB issue makes this moot.  We have to use IST unless we turn
off SCE.  But I admit I'm leaning toward turning off SCE until we have
a solution that seems convincingly robust.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help