Re: [RFC PATCH 0/5] Generic IPI sending tracepoint
From: Valentin Schneider <vschneid@redhat.com>
Date: 2022-10-11 16:17:35
Also in:
linux-alpha, linux-arm-kernel, linux-mips, linux-riscv, linux-s390, linux-sh, lkml, loongarch, sparclinux
+Cc Douglas On 07/10/22 17:01, Marcelo Tosatti wrote:
Hi Valentin, On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:quoted
Background ========== As for the targeted CPUs, the existing tracepoint does export them, albeit in cpumask form, which is quite inconvenient from a tooling perspective. For instance, as far as I'm aware, it's not possible to do event filtering on a cpumask via trace-cmd.https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html -f filter Specify a filter for the previous event. This must come after a -e. This will filter what events get recorded based on the content of the event. Filtering is passed to the kernel directly so what filtering is allowed may depend on what version of the kernel you have. Basically, it will let you use C notation to check if an event should be processed or not. ==, >=, <=, >, <, &, |, && and || The above are usually safe to use to compare fields. This looks overkill to me (consider large number of bits set in mask). +#define trace_ipi_send_cpumask(callsite, mask) do { \ + if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \ + int cpu; \ + for_each_cpu(cpu, mask) \ + trace_ipi_send_cpu(callsite, cpu); \ + } \ +} while (0)
Indeed, I expected pushback on this :-)
I went for this due to how much simpler an int is to process/use compared
to a cpumask. There is the trigger example I listed above, but the
consumption of the trace event itself as well.
Consider this event collected on an arm64 QEMU instance (output from trace-cmd)
<...>-234 [001] 37.251567: ipi_raise: target_mask=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004 (Function call interrupts)
That sort of formatting has been an issue downstream for things like LISA
[1] where events are aggregated into Pandas tables, and we need to play
silly games for performance reason because bitmasks aren't a native Python
type.
I had a look at libtraceevent to see how this data is exposed and if the
answer would be better tooling:
tep_get_field_val() just yields an unsigned long long of value 0x200018,
which AFAICT is just the [length, offset] thing associated with dynamic
arrays. Not really usable, and I don't see anything exported in the lib to
extract and use those values.
tep_get_field_raw() is better, it handles the dynamic array for us and
yields a pointer to the cpumask array at the tail of the record. With that
it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still,
this isn't a native type for many programming languages.
In contrast, this is immediately readable and consumable by userspace tools
<...>-234 [001] 37.250882: ipi_send_cpu: callsite=__smp_call_single_queue+0x5c target_cpu=2
Thinking out loud, it makes way more sense to record a cpumask in the
tracepoint, but perhaps we could have a postprocessing step to transform
those into N events each targeting a single CPU?
[1]: https://github.com/ARM-software/lisa/blob/37b51243a94b27ea031ff62bb4ce818a59a7f6ef/lisa/trace.py#L4756
quoted
Because of the above points, this is introducing a new tracepoint. Patches ======= This results in having trace events for: o smp_call_function*() o smp_send_reschedule() o irq_work_queue*() This is incomplete, just looking at arm64 there's more IPI types that aren't covered: IPI_CPU_STOP, IPI_CPU_CRASH_STOP, IPI_TIMER, IPI_WAKEUP, ... But it feels like a good starting point.Can't you have a single tracepoint (or variant with cpumask) that would cover such cases as well? Maybe (as parameters for tracepoint): * type (reschedule, smp_call_function, timer, wakeup, ...). * function address: valid for smp_call_function, irq_work_queue types.
That's a good point, I wasn't sure about having a parameter serving as discriminant for another, but the function address would be either valid or NULL which is fine. So perhaps: o callsite (i.e. _RET_IP_), serves as type o address of callback tied to IPI, if any o target CPUs
quoted
Another thing worth mentioning is that depending on the callsite, the _RET_IP_ fed to the tracepoint is not always useful - generic_exec_single() doesn't tell you much about the actual callback being sent via IPI, so there might be value in exploding the single tracepoint into at least one variant for smp_calls.Not sure i grasp what you mean by "exploding the single tracepoint...", but yes knowing the function or irq work function is very useful.
Sorry; I meant having several "specialized" tracepoints instead of a single one.