Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
From: Andy Lutomirski <luto@amacapital.net>
Date: 2016-06-22 18:26:43
Also in:
linuxppc-dev, lkml
On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf [off-list ref] wrote:
On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:quoted
quoted
So I got a chance to look at this some more. I'm thinking that to make this feature more consistently useful, we shouldn't only annotate pt_regs frames for calls to handlers; other calls should be annotated as well: preempt_schedule_irq, CALL_enter_from_user_mode, prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, etc. That way, the unwinder will always be able to find pt_regs from an interrupt/exception, even if starting from one of these other calls. But then, things get ugly. You have to either setup and tear down the frame for every possible call, or do a higher-level setup/teardown across multiple calls, which invalidates several assumptions in the entry code about the location of pt_regs on the stack. Also problematic is that several of the macros (like TRACE_IRQS_IRETQ) make assumptions about the location of pt_regs. And they're used by both syscall and interrupt code. So if we didn't create a frame pointer header for syscalls, we'd basically need two versions of the macros: one for irqs/exceptions and one for syscalls. So I think the cleanest way to handle this is to always allocate two extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK. Then all entry code can assume that pt_regs is at a constant location, and all the above problems go away. Another benefit is that we'd only need two saves instead of three -- the pointer to pt_regs is no longer needed since pt_regs is always immediately after the frame header. I worked up a patch to implement this -- see below. It writes the frame pointer in all entry paths, including syscalls. This helps keep the code simple. The downside is a small performance penalty: with getppid()-in-a-loop on my laptop, the average syscall went from 52ns to 53ns, which is about a 2% slowdown. But I doubt it would be measurable in a real-world workload. It looks like about half the slowdown is due to the extra stack allocation (which presumably adds a little d-cache pressure on the stack memory) and the other half is due to the stack writes. I could remove the writes from the syscall path but it would only save about half a ns, and it would make the code less robust. Plus it's nice to have the consistency of having *all* pt_regs frames annotated.This is a bit messy, and I'm not really sure that the entry code should be have to operate under constraints like this. Also, convincing myself this works for NMI sounds unpleasant. Maybe we should go back to my idea of just listing the call sites in a table.So are you suggesting something like: .macro ENTRY_CALL func pt_regs_offset=0 call \func 1: .pushsection .entry_calls, "a" .long 1b - . .long \pt_regs_offset .popsection .endm and then change every call in the entry code to ENTRY_CALL?
Yes, exactly, modulo whether the section name is good. hpa is probably the authority on that. --Andy