Re: [PATCH 12/19] perf: Ignore event state for group validation
From: Ian Rogers <irogers@google.com>
Date: 2025-08-27 15:15:44
Also in:
amd-gfx, dmaengine, dri-devel, imx, intel-gfx, intel-xe, linux-alpha, linux-amlogic, linux-arm-kernel, linux-arm-msm, linux-cxl, linux-fpga, linux-iommu, linux-mips, linux-perf-users, linux-pm, linux-riscv, linux-rockchip, linux-s390, linux-sh, lkml, loongarch, sparclinux
On Wed, Aug 27, 2025 at 1:18 AM Mark Rutland [off-list ref] wrote:
On Tue, Aug 26, 2025 at 11:48:48AM -0700, Ian Rogers wrote:quoted
On Tue, Aug 26, 2025 at 8:32 AM Robin Murphy [off-list ref] wrote:quoted
On 2025-08-26 2:03 pm, Peter Zijlstra wrote:quoted
On Wed, Aug 13, 2025 at 06:01:04PM +0100, Robin Murphy wrote:quoted
It may have been different long ago, but today it seems wrong for these drivers to skip counting disabled sibling events in group validation, given that perf_event_enable() could make them schedulable again, and thus increase the effective size of the group later. Conversely, if a sibling event is truly dead then it stands to reason that the whole group is dead, so it's not worth going to any special effort to try to squeeze in a new event that's never going to run anyway. Thus, we can simply remove all these checks.So currently you can do sort of a manual event rotation inside an over-sized group and have it work. I'm not sure if anybody actually does this, but its possible. Eg. on a PMU that supports only 4 counters, create a group of 5 and periodically cycle which of the 5 events is off.I'm not sure this is true, I thought this would fail in the perf_event_open when adding the 5th event and there being insufficient counters for the group.We're talking specifically about cases where the logic in a pmu's pmu::event_init() callback doesn't count events in specific states, and hence the 5th even doesn't get rejected when it is initialised. For example, in arch/x86/events/core.c, validate_group() uses collect_events(), which has: for_each_sibling_event(event, leader) { if (!is_x86_event(event) || event->state <= PERF_EVENT_STATE_OFF) continue; if (collect_event(cpuc, event, max_count, n)) return -EINVAL; n++; } ... and so where an event's state is <= PERF_EVENT_STATE_OFF at init time, that event is not counted to see if it fits into HW counters.
Hmm.. Thinking out loud. So it looked like perf with weak groups could be broken then:
$ sudo perf stat -vv -e '{instructions,cycles}:W' true
...
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x400000001
(cpu_core/PERF_COUNT_HW_INSTRUCTIONS/)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
disabled 1
inherit 1
enable_on_exec 1
------------------------------------------------------------
sys_perf_event_open: pid 3337764 cpu -1 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x400000000
(cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
inherit 1
------------------------------------------------------------
sys_perf_event_open: pid 3337764 cpu -1 group_fd 5 flags 0x8 = 7
...
Note, the group leader (instructions) is disabled because of: https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n761
/*
* Disabling all counters initially, they will be enabled
* either manually by us or by kernel via enable_on_exec
* set later.
*/
if (evsel__is_group_leader(evsel)) {
attr->disabled = 1;
but the checking of being disabled (PERF_EVENT_STATE_OFF) is only done on siblings in the code you show above. So yes, you can disable the group events to allow the perf_event_open to succeed but not on the leader which is always checked (no PERF_EVENT_STATE_OFF check): https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/core.c?h=perf-tools-next#n1204
if (is_x86_event(leader)) {
if (collect_event(cpuc, leader, max_count, n))
return -EINVAL;
Thanks, Ian