Thread (33 messages) 33 messages, 6 authors, 2023-07-06

Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

From: Valentin Schneider <vschneid@redhat.com>
Date: 2023-07-06 11:30:56
Also in: bpf, kvm, linux-doc, linux-mm, lkml

On 05/07/23 18:48, Nadav Amit wrote:
quoted
On Jul 5, 2023, at 11:12 AM, Valentin Schneider [off-list ref] wrote:

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
I have some experience with similar an optimization. Overall, it can make
sense and as you show, it can reduce the number of interrupts.

The main problem of such an approach might be in cases where a process
frequently enters and exits the kernel between deferred-IPIs, or even worse -
the IPI is sent while the remote CPU is inside the kernel. In such cases, you
pay the extra cost of synchronization and cache traffic, and might not even
get the benefit of reducing the number of IPIs.

In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
mechanism introduces while tracking whether a process is running or not. But
lazy-TLB would change is_lazy much less frequently than context tracking,
which means that the deferring the IPIs as done in this patch-set has a
greater potential to hurt performance than lazy-TLB.

tl;dr - it would be beneficial to show some performance number for both a
“good” case where a process spends most of the time in userspace, and “bad”
one where a process enters and exits the kernel very frequently. Reducing
the number of IPIs is good but I don’t think it is a goal by its own.
There already is a significant overhead incurred on kernel entry for
nohz_full CPUs due to all of context_tracking faff; now I *am* making it
worse with that extra atomic, but I get the feeling it's not going to stay
:D

nohz_full CPUs that do context transitions very frequently are
unfortunately in the realm of "you shouldn't do that". Due to what's out
there I have to care about *occasional* transitions, but some folks
consider even that to be broken usage, so I don't believe getting numbers
for that to be much relevant.
[ BTW: I did not go over the patches in detail. Obviously, there are
  various delicate points that need to be checked, as avoiding the
  deferring of IPIs if page-tables are freed. ]
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help