Thread (118 messages) 118 messages, 12 authors, 2016-06-23

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

From: Andy Lutomirski <luto@amacapital.net>
Date: 2016-05-19 23:40:16
Also in: linuxppc-dev, lkml

On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf [off-list ref] wrote:
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
quoted
On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf [off-list ref] wrote:
quoted
On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
quoted
On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" [off-list ref] wrote:
quoted
On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
quoted
On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf [off-list ref] wrote:
quoted
quoted
I suppose we could try to rejigger the code so that rbp points to
pt_regs or similar.
I think we should avoid doing something like that because it would break
gdb and all the other unwinders who don't know about it.
How so?

Currently, rbp in the entry code is meaningless.  I'm suggesting that,
when we do, for example, 'call \do_sym' in idtentry, we point rbp to
the pt_regs.  Currently it points to something stale (which the
dump_stack code might be relying on.  Hmm.)  But it's probably also
safe to assume that if you unwind to the 'call \do_sym', then pt_regs
is the next thing on the stack, so just doing the section thing would
work.
Yes, rbp is meaningless on the entry from user space.  But if an
in-kernel interrupt occurs (e.g. page fault, preemption) and you have
nested entry, rbp keeps its old value, right?  So the unwinder can walk
past the nested entry frame and keep going until it gets to the original
entry.
Yes.

It would be nice if we could do better, though, and actually notice
the pt_regs and identify the entry.  For example, I'd love to see
"page fault, RIP=xyz" printed in the middle of a stack dump on a
crash.

Also, I think that just following rbp links will lose the
actual function that took the page fault (or whatever function
pt_regs->ip actually points to).
Hm.  I think we could fix all that in a more standard way.  Whenever a
new pt_regs frame gets saved on entry, we could also create a new stack
frame which points to a fake kernel_entry() function.  That would tell
the unwinder there's a pt_regs frame without otherwise breaking frame
pointers across the frame.

Then I guess we wouldn't need my other solution of putting the idt
entries in a special section.

How does that sound?
Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back.  Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it.  (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar.  That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry".  Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem.  We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding.  This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most
handlers.

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.
Ok, I took a stab at this.  See the patch below.

In addition to annotating interrupt/exception pt_regs frames, I also
annotated all the syscall pt_regs, for consistency.

As you mentioned, it will affect performance a bit, but I think it will
be insignificant.

I think I like this approach better than putting the
interrupt/idtentry's in a special section, because this is much more
precise.  Especially now that I'm annotating pt_regs syscalls.

Also I think with a few minor changes we could implement your idea of
annotating the calls with what type of entry they are.  But I don't
think that's really needed, because the name of the interrupt/idtentry
is already on the stack trace.

Before:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887058>] async_page_fault+0x28/0x30
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd

After:

  [<ffffffff8143c243>] dump_stack+0x85/0xc2
  [<ffffffff81073596>] __do_page_fault+0x576/0x5a0
  [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0
  [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0
  [<ffffffff81887422>] async_page_fault+0x32/0x40
  [<ffffffff81887861>] pt_regs+0x1/0x10
  [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440
  [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290
  [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770
  [<ffffffff81285e32>] __vfs_read+0xe2/0x140
  [<ffffffff81286378>] vfs_read+0x98/0x140
  [<ffffffff812878c8>] SyS_read+0x58/0xc0
  [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb
  [<ffffffff81887861>] pt_regs+0x1/0x10

Note this example is with today's unwinder.  It could be made smarter to
get the RIP from the pt_regs so the '?' could be removed from
copy_page_to_iter().

Thoughts?
I think we should do that.  The silly sample patch I sent you (or at
least that I think I sent you) did that, and it worked nicely.
quoted hunk ↗ jump to hunk
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..f54886a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with
        .byte 0xf1
        .endm

+       /*
+        * Create a stack frame for the saved pt_regs.  This allows frame
+        * pointer based unwinders to find pt_regs on the stack.
+        */
+       .macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+       pushq   \regs
+       pushq   $pt_regs+1
+       pushq   %rbp
+       movq    %rsp, %rbp
+#endif
+       .endm
I don't love this part.  It's going to hurt performance, and, given
that we need to change the unwinder anyway to make it useful, let's
just emit a table somewhere in .rodata and use it directly.
quoted hunk ↗ jump to hunk
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath:
        ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
        movq    %r10, %rcx

+       CREATE_PT_REGS_FRAME
        /*
         * This call instruction is handled specially in stub_ptregs_64.
         * It might end up jumping to the slow path.  If it jumps, RAX
@@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath:
        call    *sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:

+       REMOVE_PT_REGS_FRAME
+
        movq    %rax, RAX(%rsp)
 1:
This one is particular is quite hot, so I'd much rather avoid it.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help