Thread (52 messages) 52 messages, 8 authors, 2025-08-27

Re: [PATCH 12/19] perf: Ignore event state for group validation

From: Ian Rogers <irogers@google.com>
Date: 2025-08-26 18:49:01
Also in: amd-gfx, dmaengine, dri-devel, imx, intel-gfx, intel-xe, linux-alpha, linux-amlogic, linux-arm-kernel, linux-arm-msm, linux-cxl, linux-fpga, linux-iommu, linux-mips, linux-perf-users, linux-pm, linux-riscv, linux-rockchip, linux-s390, linux-sh, lkml, loongarch, sparclinux

On Tue, Aug 26, 2025 at 8:32 AM Robin Murphy [off-list ref] wrote:
On 2025-08-26 2:03 pm, Peter Zijlstra wrote:
quoted
On Wed, Aug 13, 2025 at 06:01:04PM +0100, Robin Murphy wrote:
quoted
It may have been different long ago, but today it seems wrong for these
drivers to skip counting disabled sibling events in group validation,
given that perf_event_enable() could make them schedulable again, and
thus increase the effective size of the group later. Conversely, if a
sibling event is truly dead then it stands to reason that the whole
group is dead, so it's not worth going to any special effort to try to
squeeze in a new event that's never going to run anyway. Thus, we can
simply remove all these checks.
So currently you can do sort of a manual event rotation inside an
over-sized group and have it work.

I'm not sure if anybody actually does this, but its possible.

Eg. on a PMU that supports only 4 counters, create a group of 5 and
periodically cycle which of the 5 events is off.
I'm not sure this is true, I thought this would fail in the
perf_event_open when adding the 5th event and there being insufficient
counters for the group. Not all PMUs validate a group will fit on the
counters, but I thought at least Intel's core PMU would validate and
not allow this. Fwiw, the metric code is reliant on this behavior as
by default all events are placed into a weak group:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/metricgroup.c?h=perf-tools-next#n631
Weak groups are really just groups that when the perf_event_open fails
retry with the grouping removed. PMUs that don't fail the
perf_event_open are problematic as the reads just report "not counted"
and the metric doesn't work. Sometimes the PMU can't help it due to
errata. There are a bunch of workarounds for those cases carried in
the perf tool, but in general weak groups working is relied upon:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/pmu-events.h?h=perf-tools-next#n16

Thanks,
Ian
quoted
So I'm not against changing this, but changing stuff like this always
makes me a little fearful -- it wouldn't be the first time that when it
finally trickles down to some 'enterprise' user in 5 years someone comes
and finally says, oh hey, you broke my shit :-(
Eww, I see what you mean... and I guess that's probably lower-overhead
than actually deleting and recreating the sibling event(s) each time,
and potentially less bother then wrangling multiple groups for different
combinations of subsets when one simply must still approximate a complex
metric that requires more counters than the hardware offers.

I'm also not keen to break anything that wasn't already somewhat broken,
especially since this patch is only intended as cleanup, so either we
could just drop it altogether, or perhaps I can wrap the existing
behaviour in a helper that can at least document this assumption and
discourage new drivers from copying it. Am I right that only
PERF_EVENT_STATE_{OFF,ERROR} would matter for this, though, and my
reasoning for state <= PERF_EVENT_STATE_EXIT should still stand? As for
the fiddly discrepancy with enable_on_exec between arm_pmu and others
I'm not really sure what to think...

Thanks,
Robin.
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help