Thread (18 messages) 18 messages, 5 authors, 2023-05-11

Re: [RFC PATCH 1/1] smp: Add tracepoints for functions called with smp_call_function*()

From: Leonardo Brás <hidden>
Date: 2023-05-10 20:28:30
Also in: lkml

On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
quoted hunk ↗ jump to hunk
+Daniel

On 03/05/23 16:59, Peter Zijlstra wrote:
quoted
On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brás wrote:
quoted
IIUC the last commits add tracepoints that are collected in the
requesting CPU, at the moment of scheduling the IPI, which are also useful in
some scenarios.

On my scenario, it could help a little, since it makes possible to filter what
all other cpus are scheduling on the requested cpu. OTOH it could be also be
misleading, as the requested cpu could be running something that was scheduled
way before.

The change I propose does exactly what my scenario need: track exactly which
function was running at given time in the requested CPU. With this info, we can
check which (if any) remotely requested function was running on given time
window.
I was thinking you could simply (graph)-trace
__flush_smp_call_function_queue() with a max_graph_depth or so (Steve
says that ought to work).

But even that might be too specific, your use case sounds more like what
we have the irq-off latency tracer for, and that thing will immediately
tell you what functions were being ran.
quoted
(An unrelated thing I just thought: We could even use the commits you pointed
together with my proposed change in order to measure how long does it take for a
requested function to run / complete in the requested cpu)
I don't think you could actually do that; the send tracepoints Valentin
added don't log the csd address, this means you cannot distinguish
two CSDs with the same function send from different CPUs.

To do this you'd need to add the csd address to the the ipi_send
tracepoints and your own (possibly replacing info -- because I don't
think that's too useful).

Valentin -- is any of this something you'd also find useful?
Conceptually yeah, however:

With isolcpus + NOHZ_FULL, *any* IPI sent to an isolated CPU is game over,
you interrupt the userspace task and you eat the full meal that is
NOHZ_FULL kernel entry. Pretty much any such IPI will show up when
evaluating your setup with rteval/rlta/whatever, so having an event at
IPI reception is redundant.

IIUC Leonardo's looking at a setup with isolcpus but not necessarily
NOHZ_FULL, so he wants to look at *which* IPIs (if any) are pushing the
isolated task over its deadline/period. I would argue any IPI received
there is at risk of doing that, so it's the same fight as with NOHZ_FULL.

With that said, I suppose this could still be helpful for e.g. osnoise to
hook into and point the finger at which CPU/context sent the problematic
IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
times.

One thing though is that trace_ipi_send_cpu*() is not used solely for
CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().

Something like so...

---
diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
new file mode 100644
index 0000000000000..8fc725a2b45b9
--- /dev/null
+++ b/include/trace/events/smp.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM smp
+
+#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SMP_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(csd_queue_cpu,
+
+	TP_PROTO(const unsigned int cpu,
+		 unsigned long callsite,
+		 smp_call_func_t func,
+		 call_single_data_t *csd),
+
+	TP_ARGS(cpu, callsite, func, csd),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, cpu)
+		__field(void *, callsite)
+		__field(void *, func)
+		__field(void *, csd)
+	),
+
+	TP_fast_assign(
+		__entry->cpu = cpu;
+		__entry->callsite = (void *)callsite;
+		__entry->func = func;
+		__entry->csd  = csd;
+	),
+
+	TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
+		  __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
+);
This is for the caller side, right?
+
+DECLARE_EVENT_CLASS(csd_function,
+
+	TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+
+	TP_ARGS(func, csd),
+
+	TP_STRUCT__entry(
+		__field(void *,	func)
+		__field(void *,	csd)
+	),
+
+	TP_fast_assign(
+		__entry->func	= func;
+		__entry->csd	= csd;
+	),
+
+	TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_entry,
+	TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+	TP_ARGS(func, csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_exit,
+	TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+	TP_ARGS(func, csd)
+);
Oh, this is what event_class is for. Thanks for the example :)
quoted hunk ↗ jump to hunk
+
+#endif /* _TRACE_SMP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/smp.c b/kernel/smp.c
index ab3e5dad6cfe9..7d28db303e9bc 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -27,6 +27,9 @@
 #include <linux/jump_label.h>
 
 #include <trace/events/ipi.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/smp.h>
+#undef CREATE_TRACE_POINTS
 
 #include "smpboot.h"
 #include "sched/smp.h"
@@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
 	arch_send_call_function_ipi_mask(mask);
 }
 
+static __always_inline void
+csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
+{
+	trace_csd_function_entry(func, csd);
+	func(info);
+	trace_csd_function_exit(func, csd);
+}
+
Good one, a helper to avoid calling those traces everywhere. 

Honest question: 
Since info == csd->info and func == csd->func, we could just pass csd, right?
I suppose the suggestion on the 3-argument version is to use the values already
fetched from memory instead of fetching them again. Is that correct?
quoted hunk ↗ jump to hunk
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
 
 static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
@@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
 	 * even if we haven't sent the smp_call IPI yet (e.g. the stopper
 	 * executes migration_cpu_stop() on the remote CPU).
 	 */
-	if (trace_ipi_send_cpu_enabled()) {
+	if (trace_csd_queue_cpu_enabled()) {
 		call_single_data_t *csd;
 		smp_call_func_t func;
 
@@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
 		func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
 			sched_ttwu_pending : csd->func;
 
-		trace_ipi_send_cpu(cpu, _RET_IP_, func);
+		trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
 	}
 
 	/*
@@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
 		csd_lock_record(csd);
 		csd_unlock(csd);
 		local_irq_save(flags);
-		func(info);
+		csd_do_func(func, info, NULL);
 		csd_lock_record(NULL);
 		local_irq_restore(flags);
 		return 0;
@@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 			}
 
 			csd_lock_record(csd);
-			func(info);
+			csd_do_func(func, info, csd);
 			csd_unlock(csd);
 			csd_lock_record(NULL);
 		} else {
@@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 
 				csd_lock_record(csd);
 				csd_unlock(csd);
-				func(info);
+				csd_do_func(func, info, csd);
 				csd_lock_record(NULL);
 			} else if (type == CSD_TYPE_IRQ_WORK) {
 				irq_work_single(csd);
@@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 	/*
 	 * Third; only CSD_TYPE_TTWU is left, issue those.
 	 */
-	if (entry)
-		sched_ttwu_pending(entry);
+	if (entry) {
+		csd = llist_entry(entry, typeof(*csd), node.llist);
+		csd_do_func(sched_ttwu_pending, entry, csd);
+	}
 }
 
 
@@ -624,7 +637,7 @@ EXPORT_SYMBOL(smp_call_function_single);
 
 /**
  * smp_call_function_single_async() - Run an asynchronous function on a
- * 			         specific CPU.
+ *				 specific CPU.
  * @cpu: The CPU to run on.
  * @csd: Pre-allocated and setup data structure
  *
@@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
 	bool wait = scf_flags & SCF_WAIT;
-	int nr_cpus = 0, nr_queued = 0;
+	int nr_cpus = 0;
 	bool run_remote = false;
 	bool run_local = false;
 
@@ -786,21 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 			csd->node.src = smp_processor_id();
 			csd->node.dst = cpu;
 #endif
+
+			trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
+
 			if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
 				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 				nr_cpus++;
 				last_cpu = cpu;
 			}
-			nr_queued++;
 		}
 
-		/*
-		 * Trace each smp_function_call_*() as an IPI, actual IPIs
-		 * will be traced with func==generic_smp_call_function_single_ipi().
-		 */
-		if (nr_queued)
-			trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
-
 		/*
 		 * Choose the most efficient way to send an IPI. Note that the
 		 * number of CPUs might be zero due to concurrent changes to the
@@ -816,7 +824,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		unsigned long flags;
 
 		local_irq_save(flags);
-		func(info);
+		csd_do_func(func, info, NULL);
 		local_irq_restore(flags);
 	}
 
I will rebase my patch on top of tip/smp/core and apply the suggested changes.
A v2 will be send shortly.

Thank you all for reviewing!
Leo

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help