Re: [PATCH v20 11/11] perf: arm_pmuv3: Add support for the Branch Record... | linux-arm-kernel

Re: [PATCH v20 11/11] perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)

From: Mark Rutland <mark.rutland@arm.com>
Date: 2025-02-25 19:46:58
Also in: kvmarm, linux-doc, linux-perf-users, lkml

On Tue, Feb 25, 2025 at 12:38:13PM +0000, Leo Yan wrote:

On Mon, Feb 24, 2025 at 07:31:52PM -0600, Rob Herring wrote:

[...]

quoted

When event rotation happens, if without context switch, in theory we
should can directly use the branch record (no invalidation, no injection)
for all events.

No; that only works in *some* cases, and will produce incorrect results
in others.

For example, consider filtering. Imagine a PMU with a single counter,
and two events, where event-A filters for calls-and-returns and event-B
filters for calls-only. When switching from event-A to event-B, it's
theoretically possible to keep the existing records around, knowing that
the returns can be filtered out later. When switching from event-B to
event-A we cannot keep the existing records, since there are gaps
whenever a return should have been recorded.

Seems to me, the problem is not caused by event rotation.  We need to
calculate a correct filter in the first place - the BRBE driver should
calculate a superset for all filters of events for a session.  Then,
generate branch record based event's specific filter.

The driver doesn't have enough information. If it is told to schedule
event A, it doesn't know anything about event B. It could in theory
try to remember event B if event B had already been scheduled, but it
never knows when event B is gone.

E.g., I tried below command for enabling 10 events in a perf session:

  perf record -e armv9_nevis/r04/ -e armv9_nevis/r05/ \
              -e armv9_nevis/r06/ -e armv9_nevis/r07/ \
              -e armv9_nevis/r08/ -e armv9_nevis/r09/ \
              -e armv9_nevis/r10/ -e armv9_nevis/r11/ \
              -e armv9_nevis/r12/ -e armv9_nevis/r13/ \
              -- sleep 1

For Arm PMU, the flow below is invoked for every event on every
affinied CPU in initialization phase:

  armpmu_event_init() {
    armv8pmu_set_event_filter();
  }

Shouldn't we calculate a superset branch filter for all events, store
it into a per-CPU data structure and then apply the filter on BRBE?

Should we? No.

*NONE* of the events in your example are CPU-bound, and the call to
armpmu_event_init() can happen on an arbitrary CPU which the relevant
event never actually runs on, while other unrelated events may run on
that CPU.

It makes no sense for armv8pmu_set_event_filter() to write to a per-cpu
structure. That's purely there to determine what the filters *should* be
when *that specific event* is programmed into hardware.

As Rob and I have pointed out already, the *only* thing that can be
relevant to deciding the configuration of HW filtering is the set of
events which are *active* on that CPU.

quoted

There are a number of cases of that shape given the set of configurable
filters. In theory it's possible to retain those in some cases, but I
don't think that the complexity is justified.

Similarly, whenever kernel branches are recorded it's necessary to drop
the stale branches whenever branch recording is paused, as there's
necessarily a blackout period and hence a gap in the records.

If we save BRBE record when a process is switched out and then restore
the record when a process is switched in, should we can keep a decent
branch record for performance profiling?

Keep in mind that there's only 64 branches recorded at most. How many
branches in a context switch plus reconfiguring the PMU? Not a small
percentage of 64 I think. In traces where freeze on overflow was not
working (there's an example in v18), just the interrupt entry until
BRBE was stopped was a significant part of the trace. A context switch
is going to be similar.

That is true for kernel mode enabled tracing.  But we will have no
such kind noises for userspace only mode tracing.

As mentioned elsewhere, it's not a problem for x86, so why is it
magically a problem for arm64?

quoted

Do you have a reason why you think we *must* keep events around?

Here I am really concerned are cases when a process is preempted or
migrated.  The driver doesn't save and restore branch records for these
cases, it just invalidates all records when a task is scheduled in.

As a result, if an event overflow is close to context switch, it is
likely to capture incomplete branch records.  For a userspace-only
tracing, it is risk to capture empty branch record after preemption
and migrations.

There's the same risk if something else is recording kernel branches
when you are recording userspace only. I think the user has to be
aware if other things like context switches are perturbing their data.

I am confused for the decription above.  Does it refer to branch
recording cross different sessions?  It is fine for me that the branch
data is interleaved by different sessions (e.g. one is global tracing
and another is only per-thread tracing).

Imagine that there's an existing process with some pid ${PID}, and
concurrently, the following commands are run, either by the same user or
different users with appropriate permissions:

	# Trying to record user branches only
	perf record -j any,u -e cycles -p ${PID}

	# Trying to record kernel branches only
	perf record -j any,k -e cycles -p ${PID}

Whatever you do, the task trying to record user branches only will lose
some records:

* If we make the events mutually exclusive, the branches will only be
  recorded when the user event is installed.

* If we merge the HW filters and later apply a SW filter, it's very
  likely that kernel branches taken after exception entry have filled
  all the records, and there are no user branches left to sample.

We might need to consider an intact branch record for the single perf
session case.  E.g. if userspace program calls:

    func_a -> func_b -> func_c

In a case for only userspace tracing, we will have no chance to preserve
the call sequence of these functions after the program is switched out.

If those functions are small, it's very likely that they'll all be in
the branch history. If they're so large that they're not executed in one
scheduling quantum, do you expect them to fall within the same event
period?

I think that you're making a big deal out of an edge case that doesn't
matter much in practice.

Mark.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help