Re: [RFC][PATCH 00/21] tracing: Inter-event (e.g. latency) support

From: Namhyung Kim <namhyung@kernel.org>
Date: 2017-02-10 04:16:28
Also in: lkml

Hi Tom,

On Wed, Feb 08, 2017 at 11:24:56AM -0600, Tom Zanussi wrote:

This patchset adds support for 'inter-event' quantities to the trace
event subsystem.  The most important example of inter-event quantities
are latencies, or the time differences between two events.

One of the main motivations for adding this capability is to provide a
general-purpose base that existing existing tools such as the -RT
latency_hist patchset can be built upon, while at the same time
providing a simple way for users to track latencies (or any
inter-event quantity) generically between any two events.

Previous -RT latency_hist patchsets that take advantage of the trace
event subsystem have been submitted, but they essentially hard-code
special-case tracepoints and application logic in ways that can't be
reused.  It seemed to me that rather than providing a one-off patchset
devoted specifically to generating the specific histograms in the
latency_hist patchset, it should be possible to build the same
functionality on top of a generic layer allowing users to do similar
things for other non-latency_hist applications.

In addition to preliminary patches that add some basic missing
functionality such as a common ringbuffer-derived timestamp and
dynamically-creatable tracepoints, the overall patchset is divided up
into a few different areas that combine to produce the overall goal
(The Documentation patch explains all the details):

Looks very nice!

  - variables and simple expressions required to calculate a latency

    In order to calculate a latency or any inter-event value,
    something from one event needs to be saved and later retrieved,
    and some operation such as subtraction or addition is performed on
    it.  This means some minimal form of variables and expressions,
    which the first set of patches implements.  Saving and retrieving
    events to use in a latency calculation is normally done using a
    hash table, and that's exactly what we have with trace event hist
    triggers, so that's where variables are instantiated, set, and
    retrieved.  Basically, variables are set on one entry and
    retrieved and used by a 'matching' event.

  - 'synthetic' events, combining variables from other events

    The trace event interface is based on pseudo-files associated with
    individual events, so it wouldn't really make sense to have
    quantities derived from multiple events attached to any one of
    those events.  For that reason, the patchset implements a means of
    combining variables from other events into a separate 'synthetic'
    event, which can be treated as if it were just like any other
    trace event in the system.

  - 'actions' generating synthetic events, among other things

    Variables and synthetic events provide the data and data structure
    for new events, but something still needs to actually generate an
    event using that data.  'Actions' are expanded to provide that
    capability.  Though it hasn't been explicitly called as much
    before, the default 'action' currently for a hist trigger is to
    update the matching histogram entry's sum values.  This patchset
    essentially expands that to provide a new 'onmatch.trace(event)'
    action that can be used to have one event generate another.  The
    mechanism is extensible to other actions, and in fact the patchset
    also includes another, 'onmax(var).save(field,...)' that can be
    used to save context whenever a value exceeds the previous maximum
    (something also needed by latency_hist).

I'm submitting the patchset (based on tracing/for-next) as an RFC not
only to get comments, but because there are still some problems I
haven't fixed yet...

Here are some examples that should make things less abstract.

  ====
  Example - wakeup latency
  ====

  This basically implements the -RT latency_hist 'wakeup_latency'
  histogram using the synthetic events, variables, and actions
  described.  The output below is from a run of cyclictest using the
  following command:

    # rt-tests/cyclictest -p 80 -n -s -t 2

  What we're measuring the latency of is the time between when a
  thread (of cyclictest) is awakened and when it's scheduled in.  To
  do that we add triggers to sched_wakeup and sched_switch with the
  appropriate variables, and on a matching sched_switch event,
  generate a synthetic 'wakeup_latency' event.  Since it's just
  another trace event like any other, we can also define a histogram
  on that event, the output of which is what we see displayed when
  reading the wakeup_latency 'hist' file.

  First, we create a synthetic event called wakeup_latency, that
  references 3 variables from other events:

    # echo 'wakeup_latency lat=sched_switch:wakeup_lat \
                           pid=sched_switch:woken_pid \
                           prio=sched_switch:woken_prio' >> \
            /sys/kernel/debug/tracing/synthetic_events

  Next we add a trigger to sched_wakeup, which saves the value of the
  'common_timestamp' when that event is hit in a variable, ts0.  Note
  that this happens only when 'comm==cyclictest'.

  Also, 'common_timestamp' is a new field defined on every event (if
  needed - if there are no users of timestamps in a trace, timestamps
  won't be saved and there's no additional overhead from that).

    #  echo 'hist:keys=pid:ts0=common_timestamp.usecs if \
             comm=="cyclictest"' >> \
             /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger

  Next, we add a trigger to sched_switch.  When the pid being switched
  to matches the pid woken up by a previous sched_wakeup event, this
  event grabs the ts0 saved on that event, takes the difference
  between it and the current sched_switch's common_timestamp, and
  assigns it to a new 'wakeup_lat' variable.  It also saves a couple
  other variables and then invokes the onmatch().trace() action which
  generates a new wakeup_latency event using those variables.

    # echo 'hist:keys=woken_pid=next_pid:woken_prio=next_prio:\
       wakeup_lat=common_timestamp.usecs-ts0:onmatch().trace(wakeup_latency) \
            if next_comm=="cyclictest"' >> \
            /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

As Masami said, I think the syntax is a bit hard to understand.  Also
it'd be nice to access an event field directly (i.e. not by adding a
field in a hist).  Maybe we can use a prefix like '$' to identify hist
fields..

How about below?

  # echo 'wakeup_latency \
  		lat=sched_switch.$wakeup_lat  \
		pid=sched_switch.next_pid     \
		prio=sched_switch.next_prio' >> \
	/sys/kernel/debug/tracing/synthetic_events

  # echo 'hist: \
  		keys=pid: \
		ts0=common_timestamp.usec \
		if comm=="cyclictest"' >> \
	/sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger

  # echo 'hist: \
		keys=next_pid: \
		wakeup_lat=common_timestamp.usec-$ts0: \
		onmatch(sched_wakeup).trace(wakeup_latency) \
		if next_comm=="cyclictest"' >> \
	/sys/kernel/debug/tracing/events/sched/sched_switch/trigger

By passing an event name to 'onmatch', we can know where to find $ts0
easily IMHO.

Thanks,
Namhyung

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help