Re: [PATCH v10 4/5] arm64: perf: Enable PMU counter userspace access for perf event
From: Rob Herring <robh@kernel.org>
Date: 2021-10-14 19:25:05
Also in:
linux-arm-kernel, lkml
On Thu, Oct 14, 2021 at 11:58 AM Mark Rutland [off-list ref] wrote:
Hi Rob, This looks pretty good! I have one largish query below, and otherwise only trivialities that I'm happy to fix up. On Tue, Sep 14, 2021 at 03:47:59PM -0500, Rob Herring wrote:quoted
Arm PMUs can support direct userspace access of counters which allows for low overhead (i.e. no syscall) self-monitoring of tasks. The same feature exists on x86 called 'rdpmc'. Unlike x86, userspace access will only be enabled for thread bound events. This could be extended if needed, but simplifies the implementation and reduces the chances for any information leaks (which the x86 implementation suffers from). PMU EL0 access will be enabled when an event with userspace access is part of the thread's context. This includes when the event is not scheduled on the PMU. There's some additional overhead clearing dirty counters when access is enabled in order to prevent leaking disabled counter data from other tasks. Unlike x86, enabling of userspace access must be requested with a new attr bit: config1:1. If the user requests userspace access and 64-bit counters, then chaining will be disabled and the user will get the maximum size counter the underlying h/w can support. The modes for config1 are as follows: config1 = 0 : user access disabled and always 32-bit config1 = 1 : user access disabled and always 64-bit (using chaining if needed) config1 = 2 : user access enabled and always 32-bit config1 = 3 : user access enabled and counter size matches underlying counter.We probably need to note somewhere (i.e. in the next patch) that we mean *logically* 32-bit, and this could be a biased 64-bit counter, so userspace needs to treat the upper 32-bits of counters as UNKNOWN.
Okay, though this detail doesn't matter if the user uses the correct read loop (now in libperf).
For the `config1 = 3` case (potentially) overriding the usual long semantic, I'm struggling to understand why we need that rather than forcing the use of a 64-bit counter, because in that case: * For a CPU_CYCLES event: __armv8_pmuv3_map_event() will always pick 64-bits get_event_idx() may fail to allocate a 64-bit counter. * For other events: __armv8_pmuv3_map_event() will pick 32/64 based on long counter support get_event_idx() will only fail if there are no counters free. Whereas if __armv8_pmuv3_map_event() returned an error for the latter when long counter support is not implemented, we'd have consistent `long` semantics, and the CPU_CYCLES behaviour would be identical. What's the rationale for `3` leaving the choice to the kernel?
It's the give me the maximum sized counter the h/w can support choice. That's easier for userspace to implement. Bit 1 is more of a hint that the user wants userspace access rather than a requirement.
If the problem is discoverability, I'd be happy to add something to sysfs to describe whether the PMU has long event support.
Checking sysfs or a try for 64-bit support then fall back to 32-bit support isn't much difference. Keep in mind that x86 always succeeds here. Every userspace user will have to add whatever dance we create here. For example, each libperf test with user access (there's only 2 in my tree, but there's a series adding more) has to have an '#ifdef __aarch64__' for whatever we do here. I was seeking to minimize that. Right now, that's just a set config1 to 0x3. Also, note that libperf will opportunistically use a userspace read instead of read(). The user just has to mmap the event and libperf will use a userspace read when enabled which ultimately depends on what the mmapped page says. [...]
quoted
@@ -995,9 +1050,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event, &armv8_pmuv3_perf_cache_map, ARMV8_PMU_EVTYPE_EVENT); - if (armv8pmu_event_is_64bit(event)) + /* + * At this point, the counter is not assigned. If a 64-bit counter is + * requested, we must make sure the h/w has 64-bit counters if we set + * the event size to 64-bit because chaining is not supported with + * userspace access. This may still fail later on if the CPU cycle + * counter is in use. + */ + if (armv8pmu_event_is_64bit(event) && + (!armv8pmu_event_want_user_access(event) || + armv8pmu_has_long_event(armpmu) || (hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES))) event->hw.flags |= ARMPMU_EVT_64BIT;If we can follow my suggestion in reply to the cover text, we can make this: if (armv8pmu_event_is_64bit(event)) event->hw.flags |= ARMPMU_EVT_64BIT; /* * User events must be allocated into a single counter, and so * must not be chained. * * Most 64-bit events require long counter support, but 64-bit * CPU_CYCLES events can be placed into the dedicated cycle * counter when this is free. * if (armv8pmu_event_want_user_access()) { if (armv8pmu_event_is_64bit(event) && (hw_event_id != ARMV8_PMUV3_PERFCTR_CPU_CYCLES) && !armv8pmu_has_long_event(armpmu)) return -EINVAL; }quoted
+ /* Userspace counter access only enabled if requested and a per task event */ + if (sysctl_perf_user_access && armv8pmu_event_want_user_access(event) && + (event->attach_state & PERF_ATTACH_TASK)) + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;Can we please explicitly reject !PERF_ATTACH_TASK case? If the user requested something we don't intend to support, I'd rather return -EINVAL here, rather than continue on.
This is similar to the 64-bit case though I'm somewhat less concerned here given per cpu events aren't too useful in this case and the setup is a bit different already. Rob