Re: A list of visual Profiler UIs for linux perf

From: Stephen Brennan <hidden>
Date: 2021-09-09 18:45:54

Brendan Gregg [off-list ref] writes:

G'Day Stephen,

On Thu, Sep 9, 2021 at 5:13 AM Stephen Brennan
[off-list ref] wrote:

quoted

Hi Mark & Brendan,

Thanks for this thread - it's very useful.

Firefox Profiler is news to me and looks exciting. However, I can't see
any clear documentation on their part that the tool is client-side only.
I can't always put internal flamegraphs into a web form on the
assumption that they won't be uploaded somewhere. Do you know if there's
anything explicit that says data won't be shared? (unless I explicitly
upload it to create a link)

Flamescope is quite exciting as well! I can see how the time dimension
can be incredibly useful. Brendan, I had a couple questions regarding it:
1) Is the box color scale in terms of "number of samples in that time
interval"? If so, it would only really be useful for cpu-cycles or
instructions, correct? Something like the cpu-clock which tries to
regularly sample at a set frequency would just look monochrome?

Exclude idle stacks then cpu-cycles works. Most of our samples are
cpu-cycles based (only thing available in most of EC2). FlameScope
should already filter it:

app/perf/regexp.py:idle_stack =
re.compile("(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|xen_hypercall_vcpu_op)")

Ah that makes sense, I always visually ignored the idle stacks, to the
point that I forgot they existed in most of these profiles.

Looking at idle stacks actually always gives me grief, because when I
see them, I feel compelled to compare them to the %idle time accounted
by the kernel. I used to have this naive hope/belief that a `perf record
-e cycles -F 1000` would give me exactly 1000 samples each second, and
so I could compare the percentage of idle stacks with the %idle time.
But due to frequency scaling during idle, that's usually not the case.
I've tried looking at cpu-clock (which has its own downsides, like
firing in an IRQ context rather than NMI) to get a consistent frequency.
This works but isn't great, since I like the benefits of an NMI event.

Flamescope seems to make this frustration (why aren't my samples at an
exact rate???) less of an issue, since you can see the sample count over
time. You can see the idle periods and the times of heavy utilization,
so it matters less whether the sample frequency is clock-like in
precision.

I've also used it for other non-CPU events including off-CPU spans by
adapting it to sample equivalents.

quoted

2) I'm curious if you've considered directly using perf.data in
Flamescope, rather than perf.script? I've recently discovered the
"--symfs" and "--kallsyms" options for perf. By using perf buildid-list,
you can identify all DSOs, capture their symbol tables, and create a
minimal bundle of files to allow the perf.data to be read with useful
symbols on any system. Since perf.data contains more information,
usually with less disk space, I've started taking this approach to make
capturing, transferring, and analyzing larger recordings (especially
from customers) easier as well as more flexible and efficient. All the
same analysis can be done via the Python engine in perf-script, without
need to worry about text parsing.

We do gzip the perf script outputs. Just checking the README, I should
probably change 'perf script --header' to use -F to specify the
fields, to make it more future proof.

I haven't explored the buildid-list path since we have Java apps with
massive symbol tables that can be 100s of Mbytes of text, and other
binaries that use a mix of ELF symbol tables and DWARF debuginfo. I've
assumed this will be too big to include, but haven't tried yet. Maybe
it's better suited for some apps with smaller symbol tables?

Got it! My main use case is debugging kernel bugs from external
customers, who wouldn't want to share their application symbols anyway.
We frequently make do with just the kernel symbols and application
names, but adding in symbols from a few system utilities and libraries
can be very useful and only takes a few MiBs usually.

I've wanted a perf.data format which includes all symbols resolved for a
while now, and maybe some day I'll know enough perf innards to implement
it. The perf.data + symbol tables in a tarball has worked alright. But
my ideal would be a way to have perf (1) do dwarf stack walking and
symbol table lookups, and (2) store that data back into the perf.data
file. Then analysis could be reliably done on another machine, and it
would include all the data from the original recording. (For example,
I've had missing events due to a PERF_RECORD_THROTTLE event coming in,
which perf.script files show me.)

Anyhow, I'm probably over-optimizing at this point. Thanks for sharing
your motivation and use case!

Stephen

Brendan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help