[PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension

From: Kim Phillips <hidden>
Date: 2017-06-29 01:00:01
Also in: lkml

On Wed, 28 Jun 2017 12:26:02 +0100
Mark Rutland [off-list ref] wrote:

On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:

quoted

I'm close to finishing the bts version of userspace, and have been
testing a bit more thoroughly, so now I consistently see the excessive
PADding when recording a CPU that's idle. I.e., when I taskset the perf
record to the same CPU I specify to record's -C (taskset -c n perf
record -C n), I get max. twenty-odd number of PAD bytes at the end of
the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf
record -C m, where m != n, I get a couple of valid event records in the
buffer, and the rest of the buffer is filled with PADding.

It wouldn't be a problem except that it's wastes too much space
sometimes.  Here is a good output buffer sample from a --mmap-pages=,12
run, with only 4 PADs tacked onto the end:

0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 72 bytes
.  00000000:  4a 01                                           B COND

[...]

quoted

.  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000044:  00                                              PAD
.  00000045:  00                                              PAD
.  00000046:  00                                              PAD
.  00000047:  00                                              PAD

whereas this one - from later on in the same run - is over 99% PADs: 

0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 24512 bytes
.  00000000:  4a 00                                           B

[...]

quoted

.  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  000000b9:  00                                              PAD
...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
.  00005fbf:  00                                              PAD

Interesting.

If you cat /proc/interrupts, do you see many more SPE interrupts on CPU
n than on m?

When n == m, I see approx. 1 IRQ per SPE buffer full.

When n != m, I see neither CPU n or m incur SPE interrupts; the
workload ran but didn't get recorded, or, rather, 'idleness' got
recorded instead.

Otherwise, I wonder if this is some odd interaction with idle. Can you
try to forcefully load that other CPU?

e.g. run something like:

	taskset -c <n> sh -c 'while true; do done'

... in parallel with the tracer.

If I do a:

taskset -c 1 sh -c 'while true; do echo blah > /dev/null' & 
taskset -c 0 perf record -C 1 ...

then non-idleness and non-PADdingness get recorded.

For reference, what was your event sample period (i.e. the value of
perf_event_attr::sample_period)?

Did you modify that at all with PERF_EVENT_IOC_PERIOD?

If that's the same as 'perf record -c <period>', then, yes, I set
the period to values such as 512, 1024.

quoted

Meanwhile, when using fvp-base.dtb, my model setup stops booting the
kernel after "smp: Bringing up secondary CPUs ...".  If I however take
the second SPE node from fvp-base.dts and add it to my working device
tree, I get this during the driver probe:

[    1.042063] arm_spe_pmu spe-pmu at 0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043582] arm_spe_pmu spe-pmu at 1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

Looks like you've screwed up your IRQ partitions, so you are effectively
registering the same device twice, which then blows up due to lack of shared
irqs.

Either remove one of the devices, or use IRQ partitions to restrict them
to unique sets of CPUs.

Right, but since I want to get parity with what you're running -
fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
secondary CPUs ..." problem, and could only debug it to the PSCI driver
hitting one of these cases:

case PSCI_RET_INVALID_PARAMS:
case PSCI_RET_INVALID_ADDRESS:

Sounds like your DT is describing CPUs that don't exist (or perhaps the
same CPU several times). Certainly, PSCI and the kernel disagree on
which CPUS exist.

What exact DT are you using?

the one this commit to linux-will's perf/spe branch provides:

commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f
Author: Marc Zyngier [off-list ref]
Date:   Tue Mar 11 18:14:45 2014 +0000

    arm64: dts: add model device-tree
    
    Signed-off-by: Marc Zyngier [off-list ref]
    Signed-off-by: Will Deacon [off-list ref]

Are you using the bootwrapper, or ATF? I'm guessing you're using the
bootwrapper.

I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,
both?  I noticed the wrapper I was using was pretty old, so I updated
it.

arm-trusted-firmware, btw, has just been updated to enable SPE at lower
ELs, so I don't have to use a hacked-up version anymore.

I also updated my BL33 to the latest upstream u-boot
vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
to boot, even though it can't bring up 6 of the 7 secondary CPUs.

Which version of the bootwrapepr are you using? If it doesn't have
commit:

  ccdc936924b3682d ("Dynamically determine the set of CPUs")

... have you configured it appropriately with --with-cpu-ids?

How is your model configured?

CLUSTER0_NUM_CORES=4
CLUSTER1_NUM_CORES=4

Which CPU IDs does it think exist?

1,2,3,4,0x100,0x101,0x102,0x103

...which are different from the above device tree!:

0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300

So I imagine that's the problem, thanks!

I don't see how to tell the model to put the CPUs at different
addresses, only a lot of GICv3 redistributor switches?  btw, where can
I get updates to the run-model.sh scripts?  Answer off-list?

quoted

Note: it's yet another place I have to manually instrument the error
path in a kernel driver in lieu of it being more naturally verbose by
itself; I *implore* you to reconsider adding proper user messaging to
arm_spe_pmu_event_init().

Given this is a FW configuration issue (i.e. a system-level error), I'm
more than happy to make the PSCI driver messages more helpful where
possible.

That's completely orthogonal to the SPE debug messages for requests made
by the user.

I respectfully disagree, given the current state of the interfaces
involved.

quoted

I can't tell which part of the fvp-base device tree is not liked by the
firmware; I tried different combinations of the PSCI node, different CPU
enumerations (cpu at 100 vs cpu at 1), removing idle-states properties...any
hints appreciated.

The bootwrapper doesn't support idle. So no idle-states should be in the
DT.

If you can share your DT, bootwrapper configuration, and model
configuration, I can try to debug this with you.

I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the
set of CPUs") commit you mentioned above, and specified the cpu-ids
manually, and am now running with 8 CPUs, although linux enumerates
them as 0,1,8,9,10,11,12,13?

Thanks for your continued support,

Kim

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help