Re: Reducing impact of IO/CPU overload during `perf record` - bad handling... | linux-perf-users

Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?

From: Milian Wolff <hidden>
Date: 2021-09-09 20:39:04

On Donnerstag, 9. September 2021 21:28:10 CEST Ian Rogers wrote:

On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff [off-list ref] wrote:

quoted

Hey there!

I'm trying to profile an application that suffers from overcommit and
excessive calls to `yield` (via OMP). Generally, it is putting a lot of
strain on the system. Sadly, I cannot make `perf record` work without
dropping (a lot of) chunks / events.

Usually I'm using this command line:

perf record --call-graph dwarf -z -m 16M ...

<snip>

quoted

Is this a dificiency with the AMD perf subsystem? Or is this a generic
issue with perf? I understand that it has to enable/disable the PMU when
it's switching tasks, but potentially there are some ways to optimize
this behavior?

Hi Milian,

Hey Ian!

By using dwarf call graphs your samples are writing a dump of the
stack into the perf event ring buffer that will be processed in the
userspace perf command. The default stack dump size is 8kb and you can
lower it - for example with "--call-graph dwarf,4096".

I'm well aware of the overhead imposed by `--call-graph dwarf`, but it is a 
requirement where I'm coming from. I don't know a single "normal" linux 
distribution which enables frame pointers for system libraries for example.

I suspect that
most of the overhead you are seeing is from these stack dumps. There
is a more complete description in the man page:
https://man7.org/linux/man-pages/man1/perf-record.1.htm

Which are always worth improving:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/pe
rf/Documentation/perf-record.txt?h=perf/core

Reducing the size of the call stack is not really an option for me either. 
Rather, I would like to somehow influence perf to cope with the load somehow.

As I said above, so far `-m 16M` has often helped to prevent lost chunks. Then 
`-z` and the implicated `--aio` also drastically improve the situation 
compared to the not-so distant past.

But in my situation, neither is sufficient. Are there really no other options 
available to me? As I said, I would even be willing to impose a runtime 
penalty on the profiled application. Ideally, I'd just let perf hog up a full 
core or two. Zstd should easily crunch through ~500MB/s according to the 
benchmarks. Meaning I should - in theory - be able to compress up to 500MB / 
8kB = 62500 samples per second. This should be more than enough to accomodate 
a sampling rate of 1000Hz across 24 threads, no?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

Attachments

smime.p7s [application/pkcs7-signature] 5272 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help