Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
From: Milian Wolff <hidden>
Date: 2021-09-09 20:39:04
On Donnerstag, 9. September 2021 21:28:10 CEST Ian Rogers wrote:
On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff [off-list ref] wrote:quoted
Hey there! I'm trying to profile an application that suffers from overcommit and excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain on the system. Sadly, I cannot make `perf record` work without dropping (a lot of) chunks / events. Usually I'm using this command line:perf record --call-graph dwarf -z -m 16M ...
<snip>
quoted
Is this a dificiency with the AMD perf subsystem? Or is this a generic issue with perf? I understand that it has to enable/disable the PMU when it's switching tasks, but potentially there are some ways to optimize this behavior?Hi Milian,
Hey Ian!
By using dwarf call graphs your samples are writing a dump of the stack into the perf event ring buffer that will be processed in the userspace perf command. The default stack dump size is 8kb and you can lower it - for example with "--call-graph dwarf,4096".
I'm well aware of the overhead imposed by `--call-graph dwarf`, but it is a requirement where I'm coming from. I don't know a single "normal" linux distribution which enables frame pointers for system libraries for example.
I suspect that most of the overhead you are seeing is from these stack dumps. There is a more complete description in the man page: https://man7.org/linux/man-pages/man1/perf-record.1.htm Which are always worth improving: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/pe rf/Documentation/perf-record.txt?h=perf/core
Reducing the size of the call stack is not really an option for me either. Rather, I would like to somehow influence perf to cope with the load somehow. As I said above, so far `-m 16M` has often helped to prevent lost chunks. Then `-z` and the implicated `--aio` also drastically improve the situation compared to the not-so distant past. But in my situation, neither is sufficient. Are there really no other options available to me? As I said, I would even be willing to impose a runtime penalty on the profiled application. Ideally, I'd just let perf hog up a full core or two. Zstd should easily crunch through ~500MB/s according to the benchmarks. Meaning I should - in theory - be able to compress up to 500MB / 8kB = 62500 samples per second. This should be more than enough to accomodate a sampling rate of 1000Hz across 24 threads, no? Thanks -- Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbH, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt, C++ and OpenGL Experts
Attachments
- smime.p7s [application/pkcs7-signature] 5272 bytes