Thread (118 messages) 118 messages, 12 authors, 2016-06-23

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

From: Andy Lutomirski <luto@amacapital.net>
Date: 2016-06-22 18:26:43
Also in: linuxppc-dev, lkml

On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf [off-list ref] wrote:
On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
quoted
quoted
So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even if starting from one of these other calls.

But then, things get ugly.  You have to either setup and tear down the
frame for every possible call, or do a higher-level setup/teardown
across multiple calls, which invalidates several assumptions in the
entry code about the location of pt_regs on the stack.

Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
make assumptions about the location of pt_regs.  And they're used by
both syscall and interrupt code.  So if we didn't create a frame pointer
header for syscalls, we'd basically need two versions of the macros: one
for irqs/exceptions and one for syscalls.

So I think the cleanest way to handle this is to always allocate two
extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
entry code can assume that pt_regs is at a constant location, and all
the above problems go away.  Another benefit is that we'd only need two
saves instead of three -- the pointer to pt_regs is no longer needed
since pt_regs is always immediately after the frame header.

I worked up a patch to implement this -- see below.  It writes the frame
pointer in all entry paths, including syscalls.  This helps keep the
code simple.

The downside is a small performance penalty: with getppid()-in-a-loop on
my laptop, the average syscall went from 52ns to 53ns, which is about a
2% slowdown.  But I doubt it would be measurable in a real-world
workload.

It looks like about half the slowdown is due to the extra stack
allocation (which presumably adds a little d-cache pressure on the stack
memory) and the other half is due to the stack writes.

I could remove the writes from the syscall path but it would only save
about half a ns, and it would make the code less robust.  Plus it's nice
to have the consistency of having *all* pt_regs frames annotated.
This is a bit messy, and I'm not really sure that the entry code
should be have to operate under constraints like this.  Also,
convincing myself this works for NMI sounds unpleasant.

Maybe we should go back to my idea of just listing the call sites in a table.
So are you suggesting something like:

        .macro ENTRY_CALL func pt_regs_offset=0
        call \func
1:      .pushsection .entry_calls, "a"
        .long 1b - .
        .long \pt_regs_offset
        .popsection
        .endm

and then change every call in the entry code to ENTRY_CALL?
Yes, exactly, modulo whether the section name is good.  hpa is
probably the authority on that.

--Andy
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help