Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel
From: Ze Gao <hidden>
Date: 2023-08-04 03:19:51
Also in:
linux-perf-users, lkml
On Fri, Aug 4, 2023 at 10:38 AM Ze Gao [off-list ref] wrote:
On Fri, Aug 4, 2023 at 10:21 AM Ze Gao [off-list ref] wrote:quoted
On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt [off-list ref] wrote:quoted
On Thu, 3 Aug 2023 04:33:48 -0400 Ze Gao [off-list ref] wrote:quoted
Update state char array and then remove unused and stale macros, which are kernel internal representations and not encouraged to use anymore. Signed-off-by: Ze Gao <redacted> --- tools/perf/builtin-sched.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-)diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index 9ab300b6f131..8dc8f071721c 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c@@ -92,23 +92,12 @@ struct sched_atom { struct task_desc *wakee; }; -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP" +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"Thinking about this more, this will always be wrong. Changing it just works for the kernel you made the change for, but if it is run on another kernel, it's broken again.Indeed. There is no easy way to maintain backward compatibility unless we stop using this bizarre 'prev_state' field. Basically all its users suffer from this. That's why I believe this needs a fix to alert people does not use 'prev_state' anymore.quoted
I actually wrote code once that basically just did a: struct trace_seq s; trace_seq_init(&s); tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO); then searched s.buffer for "prev_state=%s ", to find the state character. That's because the kernel should always be up to date (and why I said I needed that string in the print_fmt).Turing to building the state char array from print fmt string dynamically is a great idea. :)
I realize this is not perfect as well after second thoughts, since this does not take offline use of perf into consideration. People might run perf on different machines than where the perf.data gets recorded, in which way what we get from /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely different from the perf.data. So let's parse it from TEP_PRINT_INFO of each record instead of building the state char array and rely on 'prev_state' again. At least this fix all tools that have TEP_PRINT_INFO available. Thanks, Ze
quoted
quoted
As perf has a tep handle, this could be a helper function to extract the state if needed, and get rind of relying on the above character array.I'll figure out how to make it happen. BTW, my last concern is that is there any better way to notice userspace to avoid interpreting task state out of 'prev_state'. Because the awkward thing happens again.By userspace, I mean all tools consume 'prev_state' but don't have print fmt available, taking bpf tracepoint for example. Regards, Zequoted
Thanks, Zequoted
-- Stevequoted
/* task state bitmask, copied from include/linux/sched.h */ #define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 -#define __TASK_STOPPED 4 -#define __TASK_TRACED 8 -/* in tsk->exit_state */ -#define EXIT_DEAD 16 -#define EXIT_ZOMBIE 32 -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) -/* in tsk->state again */ -#define TASK_DEAD 64 -#define TASK_WAKEKILL 128 -#define TASK_WAKING 256 -#define TASK_PARKED 512 enum thread_state { THREAD_SLEEPING = 0,