Re: [RFC] perf: proposed perf_event_open() manpage

From: Namhyung Kim <namhyung@kernel.org>
Date: 2012-10-24 06:54:53
Also in: linux-perf-users, lkml

Possibly related (same subject, not in this thread)

2012-10-23 · [RFC] perf: proposed perf_event_open() manpage · Vince Weaver <hidden>

Hi Vince,

Great work!

On Tue, 23 Oct 2012 11:35:13 -0400 (EDT), Vince Weaver wrote:

Hello

attached is a proposed manpage for the perf_event_open() system call.

I'd appreciate any review or comments, especially for the parts marked
as FIXME or "[To be documented]"

This system call has a complicated interface and I'm sure I've missed
or glossed over various important features, so your feedback is needed and 
appreciated.

The eventual goal is to have this included with the Linux man-pages 
project.

[snip]

.BI "int perf_event_open(struct perf_event_attr *" hw_event ,

hw_event?  Looks unusual.. how about 'attr'?

.BI "                    pid_t " pid ", int " cpu ", int " group_fd ,
.BI "                    unsigned long " flags  );
.fi

[snip]

.SS Arguments
.P
The argument
.I pid
allows events to be attached to processes in various ways.
If
.I pid
is 0, measurements happen on the current task, if
.I pid
is greater than 0, the process indicated by
.I pid
is measured, and if
.I pid
is less than 0, all processes are counted.

Is that true?  Shouldn't pid be -1?

The
.I cpu
argument allows measurements to be specific to a CPU.
If
.I cpu
is greater than or equal to 0,
measurements are restricted to the specified CPU;
if
.I cpu
is \-1, the events are measured on all CPUs.
.P
Note that the combination of
.IR pid " == \-1"
and
.IR cpu " == \-1"
is not valid.
.P
A
.IR pid " > 0"

s/>/>=/ ?

and
.IR cpu " == \-1"
setting measures per-process and follows that process to whatever CPU the
process gets scheduled to.
Per-process events can be created by any user.
.P
A
.IR pid " == \-1"
and
.IR cpu " >= 0"
setting is per-CPU and measures all processes on the specified CPU.
Per-CPU events need the
.B CAP_SYS_ADMIN
capability.

Or value of perf_event_paranoid is less than 1.

.TP
.RB "dynamic PMU"
Since Linux 2.6.39,
.BR perf_event_open()
can support multiple PMUs.
To enable this, a value exported by the kernel can be used in the
.I type
field to indicate which PMU to use.
The value to use can be found in the sysfs filesystem:
there is a subdirectory per PMU instance under
.IR /sys/devices .

/sys/bus/event_source/devices will be the right place.

In each sub-directory there is a
.I type
file whose content is an integer that can be used in the
.I type
field.
For instance,
.I /sys/devices/cpu/type

/sys/bus/event_source/devices/cpu/type

contains the value for the core CPU PMU, which is usually 4.
.RE

[snip]

.TP
.IR sample_period ", " sample_freq
A "sampling" counter is one that generates an interrupt
every N events, where N is given by
.IR sample_period .
A sampling counter has
.IR sample_period " > 0."

How about adding this here:

"When an (overflow) interrupt generated, requested data (sample) would
be recorded."

The
.I sample_type
field controls what data is recorded on each interrupt.

.I sample_freq
can be used if you wish to use frequency rather than period.
In this case you set the
.I freq
flag.
The kernel will adjust the sampling period
to try and achieve the desired rate.
The rate of adjustment is a
timer tick.

Is that true?  I thought it'd be adjusted whenever overflow occures.


.TP
.I "sample_type"
The various bits in this field specify which values to include
in the overflow packets.

I guess the overflow packets here means samples.  It'd be better if we
use a consistent word for specifying a thing.

They will be recorded in a ring-buffer,
which is available to user-space using
.BR mmap (2).
The order in which the values are saved in the
overflow packets as documented in the MMAP Layout subsection below;
it is not the
.I "enum perf_event_sample_format"
order.
.RS
.TP
.B PERF_SAMPLE_IP
instruction pointer
.TP
.B PERF_SAMPLE_TID
thread id
.TP
.B PERF_SAMPLE_TIME
time
.TP
.B PERF_SAMPLE_ADDR
address
.TP
.B PERF_SAMPLE_READ
[To be documented]

It's for an event group to sample leader only.  Values of other members
will be read when an interrupt occurred on the leader.

Jiri is working on it.

.TP
.B PERF_SAMPLE_CALLCHAIN
[To be documented]

callchain (or stack backtrace)

.TP
.B PERF_SAMPLE_ID
[To be documented]

unique(?) id for the opened event.

.TP
.B PERF_SAMPLE_CPU
[To be documented]

cpu number

.TP
.B PERF_SAMPLE_PERIOD
[To be documented]

event count

.TP
.B PERF_SAMPLE_STREAM_ID
[To be documented]
.TP
.B PERF_SAMPLE_RAW
[To be documented]

additional data - usually for tracepoint events

.TP
.BR PERF_SAMPLE_BRANCH_STACK " (Since Linux 3.4)"
[To be documented]

requested branch stack - only supported on intel machines which has LBR
feature(?).  See branch_sample_type.

.RE

[snip]

.SS /proc/sys/kernel/perf_event_paranoid

The
.I /proc/sys/kernel/perf_event_paranoid
file can be set to restrict access to the performance counters.
2
means no measurements allowed,

This is not true.  It only allows user mode measurements.

$ cat /proc/sys/kernel/perf_event_paranoid 
2

$ perf stat usleep 1
  Error: You may not have permission to collect stats.
	 Consider tweaking /proc/sys/kernel/perf_event_paranoid or running as root.
Not all events could be opened.

$ perf stat -e cycles:u usleep 1

 Performance counter stats for 'usleep 1':

           253,055 cycles:u                  #    0.000 GHz                    

       0.001988538 seconds time elapsed

1
means normal counter access,

This includes kernel mode measurements.

0
means you can access CPU-specific data, and

But cannot access raw tracepoint samples.

\-1
means no restrictions.


Thanks,
Namhyung

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help