[PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support

From: Kim Phillips <hidden>
Date: 2017-08-18 22:22:53
Also in: lkml

On Fri, 18 Aug 2017 17:59:25 +0100
Mark Rutland [off-list ref] wrote:

On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote:

quoted

On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland [off-list ref] wrote:

quoted

On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:

<snip>

quoted

 	if (evlist) {
 		evlist__for_each_entry(evlist, evsel) {
 			if (cs_etm_pmu &&
 			    evsel->attr.type == cs_etm_pmu->type)
 				found_etm = true;
+			if (arm_spe_pmu &&
+			    evsel->attr.type == arm_spe_pmu->type)
+				found_spe = true;

Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
and so on).

Can we not find all PMUs with a "arm_spe_" prefix?

... or, taking a step back, do we need some sysfs "class" attribute to
identify multi-instance PMUs?

Since there is one SPE per core, and it looks like the buffer full
interrupt line is the only difference between the SPE device node
specification in the device tree, I guess I don't understand why the
driver doesn't accept a singular "arm_spe" from the tool, and manage
interrupt handling accordingly.

There are also differences which can be probed from the device, which

The only thing I see is PMSIDR fields describing things like minimum
recommended sampling interval.  So if CPU A's SPE has that as 256, and
CPU B's is 512, then deny the user asking for a 256 interval across the
two CPUs.  Or, better yet, issue a warning stating the driver has raised
the interval to the lowest common denominator of all CPU SPEs involved
(512 in the above case).

are not described in the DT (but are described in sysfs). Some of these
are exposed under sysfs.

There may be further differences in subsequent revisions of the
architecture, too.

Future SPE lowest common denominator rules can be amended
according to the capabilities of the new system.

So the safest bet is to expose them separately, as we
do for other CPU-affine PMUs in heterogeneous systems.

Yes, perf is very hard to use on heterogeneous systems for this reason.
Cycles are cycles, it doesn't matter whether they're on an A53 or an
A72.

Meanwhile, this type of driver behaviour - and the fact that the drivers
are mute - hurts usability in heterogeneous environments, and can
easily be avoided.

quoted

Also, if a set of CPUs are missing SPE support, and the user doesn't
explicitly define a CPU affinity to outside that partition, then
decline to run, stating why.

It's possible for userspace to do this regardless; look for the set of
SPE PMUs, and then look at their masks.

The driver still has to check if what the user is asking for, is
doable.  They also may not be using the perf tool.

quoted

+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	perf_evlist__to_front(evlist, arm_spe_evsel);

Huh? *what* needs the auxtrace buffer fd?

This seems really fragile. Can't we store this elsewhere?

It's copied from the bts code, and the other auxtrace record users do
the same; it looks like auxtrace record has implicit dependencies on it?

Is it definitely required? What happens if this isn't done?

It says it wouldn't obtain the auxtrace buffer file descriptor.

quoted

+static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	u64 ts;
+
+	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
+
+	return ts;
+}

I do not think it's a good idea to read the counter directly like this.

What is this "reference" intended to be meaningful relative to?

AFAICT, it's just a nonce the perf tool uses to track unique events,
and I thought this better than the ETM driver's heavier get-random
implementation.

quoted

Why do we need to do this in userspace?

Can we not ask the kernel to output timestamps instead?

Why?  This gets the job done faster.

I had assumed that this needed to be correlated with the timestamps in
the event.

If this is a nonce, please don't read the counter directly in this way.
It may be trapped/emulated by a higher EL, making it very heavyweight.
The counter is only exposed so that the VDSO can use it, and that will
avoid using it in cases where it is unsafe.

Got it, thanks.

quoted

+	packet->type = ARM_SPE_EVENTS;
+	packet->index = events_len;

Huh? The events packet has no "index" field, so why do we need this?

To identify Events with a less number of comparisons in arm_spe_pkt_desc():
E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
identified iff index > 1.

It would be clearer to do the additional comparisons there.

Does this make a measureable difference in practice?

It should - I'll add a comment.

quoted

+int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
+		     size_t buf_len)
+{
+	int ret, ns, el, index = packet->index;
+	unsigned long long payload = packet->payload;
+	const char *name = arm_spe_pkt_name(packet->type);
+
+	switch (packet->type) {

quoted

+	case ARM_SPE_ADDRESS:
+		switch (index) {
+		case 0:
+		case 1: ns = !!(packet->payload & NS_FLAG);
+			el = (packet->payload & EL_FLAG) >> 61;
+			payload &= ~(0xffULL << 56);
+			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
+				        (index == 1) ? "TGT" : "PC", payload, el, ns);

quoted

Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
that things get padded consistently?

I've added the 0x prefix, but prefer to not fix the length to 016: I
don't see any direct benefit, rather see benefits to having the length
vary, for output size control and less obvious reasons, e.g., sorting
address lines by their length to get a sense of address groups caught
during the run.  FWIW, Intel doesn't do the 016 either.

With padding, sorting will also place address groups together, so I'm
not sure I follow.

sorting by *line length* can be done to easily assess the address
groups in a dump:

$ grep -w PC dump | awk '{ print length, $0 }' | sort -nu
77 .  00000080:  b0 00 00 00 00 00 00 00 a0                      PC 0x0 el1 ns=1
82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
88 .  00000000:  b0 50 20 ac a7 ff ff 00 80                      PC 0xffffa7ac2050 el0 ns=1
89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1

Padding makes it *much* easier to scan over the output by eye, as
columns of event data will always share the same alignment.

Addresses are already technically misaligned by virtue of their being
prepended with "PC" (2 chars) vs. "TGT" (3 chars):

82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
83 .  0000001e:  b1 68 61 43 00 00 00 00 80                      TGT 0x436168 el0 ns=1

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      TGT 0xff000008929aec el1 ns=1

If you're talking about the postpended "elX ns=Y", well, that less
significant given the variable length is more quickly detected by the
eye - giving the astute reader hints of which execution level the
address is in - and can be parsed using variable length delimeters.

OTOH, we can rename the tokens, e.g., 

current PC  -> {NS,SE}EL{0,1,2,3}PC 0xAAAA
current TGT -> {NS,SE}EL{0,1,2,3}BT 0xAAAA

Where "BT" -> "Branch Target", which admittedly is less obvious to the
uninitiated.

So the last sample above would be:

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      NSEL1PC 0x1000008082d80
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      NSEL1BT 0xff000008929aec

Is that better though?

Are there others opinionated here?

I'll get to the v2 review comments later.

Thanks for your feedback!

Kim

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help