Thread (33 messages) 33 messages, 6 authors, 2023-07-06

Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

From: Valentin Schneider <vschneid@redhat.com>
Date: 2023-07-06 11:32:28
Also in: bpf, kvm, linux-doc, linux-mm, lkml

On 06/07/23 00:39, Peter Zijlstra wrote:
On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:
quoted
Note: A previous approach by PeterZ [1] used an extra bit in
context_tracking.state to flag the presence of deferred callbacks to
execute, and the actual callbacks were stored in a separate atomic
variable.

This meant that the atomic read of context_tracking.state was sufficient to
determine whether there are any deferred callbacks to execute.
Unfortunately, it presents a race window. Consider the work setting
function as:

  preempt_disable();
  seq = atomic_read(&ct->seq);
  if (__context_tracking_seq_in_user(seq)) {
       /* ctrl-dep */
       atomic_or(work, &ct->work);
       ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
  }
  preempt_enable();

  return ret;

Then the following can happen:

  CPUx                                             CPUy
                                                  CT_SEQ_WORK \in context_tracking.state
    atomic_or(WORK_N, &ct->work);
                                                   ct_kernel_enter()
                                                     ct_state_inc();
    atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);

The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
sent. Unfortunately, the work bit would remain set, and it can't be sanely
cleared in case another CPU set it concurrently - this would ultimately
lead to a double execution of the callback, one as a deferred callback and
one in the IPI. As not all IPI callbacks are idempotent, this is
undesirable.
So adding another atomic is arguably worse.

The thing is, if the NOHZ_FULL CPU is actually doing context transitions
(SYSCALLs etc..) then everything is fundamentally racy, there is no
winning that game, we could find the remote CPU is in-kernel, send an
IPI, the remote CPU does return-to-user and receives the IPI.

And then the USER is upset... because he got an IPI.
Yeah, that part is inevitably racy.

The thing I was especially worried about was the potential double
executions (once in IPI, again in deferred work). It's not /too/ bad as the
only two deferred callbacks I'm introducing here are costly-but-stateless,
but IMO is a bad foundation.

But it seems like we can reuse the existing atomic and squeeze some bits in
there, so let's see how that goes :-)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help