Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a... | linux-trace-kernel

[RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
Re: [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask · Steven Rostedt <rostedt@goodmis.org> · 2023-07-05
[RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
Re: [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask · Steven Rostedt <rostedt@goodmis.org> · 2023-07-05
[RFC PATCH 05/14] tracing/filters: Document cpumask filtering · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common field by a cpumask · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 06/14] objtool: Flesh out warning related to pv_ops[] calls · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 07/14] objtool: Warn about non __ro_after_init static key usage in .noinstr · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
Re: [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init · Peter Zijlstra <peterz@infradead.org> · 2023-07-05
[RFC PATCH 09/14] x86/kvm: Make kvm_async_pf_enabled __ro_after_init · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 10/14] x86/sev-es: Make sev_es_enable_key __ro_after_init · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Frederic Weisbecker <frederic@kernel.org> · 2023-07-05
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Peter Zijlstra <peterz@infradead.org> · 2023-07-05
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Frederic Weisbecker <frederic@kernel.org> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Frederic Weisbecker <frederic@kernel.org> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Valentin Schneider <vschneid@redhat.com> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Frederic Weisbecker <frederic@kernel.org> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · "Paul E. McKenney" <paulmck@kernel.org> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Valentin Schneider <vschneid@redhat.com> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Frederic Weisbecker <frederic@kernel.org> · 2023-07-06
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Peter Zijlstra <peterz@infradead.org> · 2023-07-05
Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure · Valentin Schneider <vschneid@redhat.com> · 2023-07-06
[RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
[RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs · Valentin Schneider <vschneid@redhat.com> · 2023-07-05
Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition · Nadav Amit <hidden> · 2023-07-05
Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition · Valentin Schneider <vschneid@redhat.com> · 2023-07-06
Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition · Steven Rostedt <rostedt@goodmis.org> · 2023-07-05
Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition · Valentin Schneider <vschneid@redhat.com> · 2023-07-06

Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

From: Valentin Schneider <vschneid@redhat.com>
Date: 2023-07-06 11:30:56
Also in: bpf, kvm, linux-doc, linux-mm, lkml

On 05/07/23 18:48, Nadav Amit wrote:

quoted

On Jul 5, 2023, at 11:12 AM, Valentin Schneider [off-list ref] wrote:

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

I have some experience with similar an optimization. Overall, it can make
sense and as you show, it can reduce the number of interrupts.

The main problem of such an approach might be in cases where a process
frequently enters and exits the kernel between deferred-IPIs, or even worse -
the IPI is sent while the remote CPU is inside the kernel. In such cases, you
pay the extra cost of synchronization and cache traffic, and might not even
get the benefit of reducing the number of IPIs.

In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
mechanism introduces while tracking whether a process is running or not. But
lazy-TLB would change is_lazy much less frequently than context tracking,
which means that the deferring the IPIs as done in this patch-set has a
greater potential to hurt performance than lazy-TLB.

tl;dr - it would be beneficial to show some performance number for both a
“good” case where a process spends most of the time in userspace, and “bad”
one where a process enters and exits the kernel very frequently. Reducing
the number of IPIs is good but I don’t think it is a goal by its own.

There already is a significant overhead incurred on kernel entry for
nohz_full CPUs due to all of context_tracking faff; now I *am* making it
worse with that extra atomic, but I get the feeling it's not going to stay
:D

nohz_full CPUs that do context transitions very frequently are
unfortunately in the realm of "you shouldn't do that". Due to what's out
there I have to care about *occasional* transitions, but some folks
consider even that to be broken usage, so I don't believe getting numbers
for that to be much relevant.

[ BTW: I did not go over the patches in detail. Obviously, there are
  various delicate points that need to be checked, as avoiding the
  deferring of IPIs if page-tables are freed. ]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help