Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
From: Valentin Schneider <vschneid@redhat.com>
Date: 2023-07-06 11:32:28
Also in:
bpf, kvm, linux-doc, linux-mm, lkml
On 06/07/23 00:39, Peter Zijlstra wrote:
On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:quoted
Note: A previous approach by PeterZ [1] used an extra bit in context_tracking.state to flag the presence of deferred callbacks to execute, and the actual callbacks were stored in a separate atomic variable. This meant that the atomic read of context_tracking.state was sufficient to determine whether there are any deferred callbacks to execute. Unfortunately, it presents a race window. Consider the work setting function as: preempt_disable(); seq = atomic_read(&ct->seq); if (__context_tracking_seq_in_user(seq)) { /* ctrl-dep */ atomic_or(work, &ct->work); ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); } preempt_enable(); return ret; Then the following can happen: CPUx CPUy CT_SEQ_WORK \in context_tracking.state atomic_or(WORK_N, &ct->work); ct_kernel_enter() ct_state_inc(); atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be sent. Unfortunately, the work bit would remain set, and it can't be sanely cleared in case another CPU set it concurrently - this would ultimately lead to a double execution of the callback, one as a deferred callback and one in the IPI. As not all IPI callbacks are idempotent, this is undesirable.So adding another atomic is arguably worse. The thing is, if the NOHZ_FULL CPU is actually doing context transitions (SYSCALLs etc..) then everything is fundamentally racy, there is no winning that game, we could find the remote CPU is in-kernel, send an IPI, the remote CPU does return-to-user and receives the IPI. And then the USER is upset... because he got an IPI.
Yeah, that part is inevitably racy. The thing I was especially worried about was the potential double executions (once in IPI, again in deferred work). It's not /too/ bad as the only two deferred callbacks I'm introducing here are costly-but-stateless, but IMO is a bad foundation. But it seems like we can reuse the existing atomic and squeeze some bits in there, so let's see how that goes :-)