Thread (118 messages) 118 messages, 12 authors, 2016-06-23

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

From: Andy Lutomirski <luto@amacapital.net>
Date: 2016-06-22 17:59:46
Also in: linux-s390, lkml

On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf [off-list ref] wrote:
On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
quoted
On May 23, 2016 7:28 PM, "Josh Poimboeuf" [off-list ref] wrote:
quoted
quoted
Maybe I'm coming around to liking this idea.
Ok, good :-)
quoted
In an ideal world (DWARF support, high-quality unwinder, nice pretty
printer, etc), unwinding across a kernel exception would look like:

 - some_func
 - some_other_func
 - do_page_fault
 - page_fault

After page_fault, the next unwind step takes us to the faulting RIP
(faulting_func) and reports that all GPRs are known.  It should
probably learn this fact from DWARF if DWARF is available, instead of
directly decoding pt_regs, due to a few funny cases in which pt_regs
may be incomplete.  A nice pretty printer could now print all the
regs.

 - faulting_func
 - etc.

For this to work, we don't actually need the unwinder to explicitly
know where pt_regs is.
That's true (but only for DWARF).
quoted
Food for thought, though: if user code does SYSENTER with TF set,
you'll end up with partial pt_regs.  There's nothing the kernel can do
about it.  DWARF will handle it without any fanfare, but I don't know
if it's going to cause trouble for you pre-DWARF.
In this case it should see the stack pointer is past the pt_regs offset,
so it would just report it as an empty stack.
OK
quoted
quoted
I'm also not sure it makes sense to apply this before the unwinder
that can consume it is ready.  Maybe, if it would be consistent with
your plans, it would make sense to rewrite the unwinder first, then
apply this and teach live patching to use the new unwinder, and *then*
add DWARF support?
For the purposes of livepatch, the reliable unwinder needs to detect
whether an interrupt/exception pt_regs frame exists on a sleeping task
(or current).  This patch would allow us to do that.

So my preferred order of doing things would be:

1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
2) this patch for annotating pt_regs stack frames
3) reliable unwinder, similar to what I already posted, except it relies
   on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
   the new inactive task frame format of #1
4) livepatch consistency model which uses the reliable unwinder
5) rewrite unwinder, and port all users to the new interface
6) add DWARF unwinder

1-4 are pretty much already written, whereas 5 and 6 will take
considerably more work.
Fair enough.
quoted
quoted
quoted
+       /*
+        * Create a stack frame for the saved pt_regs.  This allows frame
+        * pointer based unwinders to find pt_regs on the stack.
+        */
+       .macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+       pushq   \regs
+       pushq   $pt_regs+1
Why the +1?
Some unwinders like gdb are smart enough to report the function which
contains the instruction *before* the return address.  Without the +1,
they would show the wrong function.
Lovely.  Want to add a comment?
quoted
quoted
quoted
+       pushq   %rbp
+       movq    %rsp, %rbp
+#endif
+       .endm
I keep wanting this to be only two pushes and to fudge rbp to make it
work, but I don't see a good way.  But let's call it
CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
nested_frame or similar.
Or, if we aren't going to annotate syscall pt_regs, we could give it a
more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
too.  CREATE_PT_REGS_FRAME is probably fine.
quoted
quoted
quoted
+
+/* fake function which allows stack unwinders to detect pt_regs frames */
+#ifdef CONFIG_FRAME_POINTER
+ENTRY(pt_regs)
+       nop
+       nop
+ENDPROC(pt_regs)
+#endif /* CONFIG_FRAME_POINTER */
Why is this two bytes long?  Is there some reason it has to be more
than one byte?
Similar to above, this is related to the need to support various
unwinders.  Whether the unwinder displays the ret_addr or the
instruction preceding it, either way the instruction needs to be inside
the function for the function to be reported.
OK.
Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even if starting from one of these other calls.

But then, things get ugly.  You have to either setup and tear down the
frame for every possible call, or do a higher-level setup/teardown
across multiple calls, which invalidates several assumptions in the
entry code about the location of pt_regs on the stack.

Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
make assumptions about the location of pt_regs.  And they're used by
both syscall and interrupt code.  So if we didn't create a frame pointer
header for syscalls, we'd basically need two versions of the macros: one
for irqs/exceptions and one for syscalls.

So I think the cleanest way to handle this is to always allocate two
extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
entry code can assume that pt_regs is at a constant location, and all
the above problems go away.  Another benefit is that we'd only need two
saves instead of three -- the pointer to pt_regs is no longer needed
since pt_regs is always immediately after the frame header.

I worked up a patch to implement this -- see below.  It writes the frame
pointer in all entry paths, including syscalls.  This helps keep the
code simple.

The downside is a small performance penalty: with getppid()-in-a-loop on
my laptop, the average syscall went from 52ns to 53ns, which is about a
2% slowdown.  But I doubt it would be measurable in a real-world
workload.

It looks like about half the slowdown is due to the extra stack
allocation (which presumably adds a little d-cache pressure on the stack
memory) and the other half is due to the stack writes.

I could remove the writes from the syscall path but it would only save
about half a ns, and it would make the code less robust.  Plus it's nice
to have the consistency of having *all* pt_regs frames annotated.
This is a bit messy, and I'm not really sure that the entry code
should be have to operate under constraints like this.  Also,
convincing myself this works for NMI sounds unpleasant.

Maybe we should go back to my idea of just listing the call sites in a table.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help