Thread (38 messages) 38 messages, 7 authors, 2024-08-22

Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

From: Andrii Nakryiko <hidden>
Date: 2024-08-15 20:17:18
Also in: bpf, linux-mm, lkml

On Thu, Aug 15, 2024 at 11:58 AM Jann Horn [off-list ref] wrote:
+brauner for "struct file" lifetime

On Thu, Aug 15, 2024 at 7:45 PM Suren Baghdasaryan [off-list ref] wrote:
quoted
On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko
[off-list ref] wrote:
quoted
On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik [off-list ref] wrote:
quoted
On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
quoted
On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik [off-list ref] wrote:
quoted
On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
quoted
Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
attempting uprobe look up speculatively.
Stupid question: Is this uprobe stuff actually such a hot codepath
that it makes sense to optimize it to be faster than the page fault
path?
Not a stupid question, but yes, generally speaking uprobe performance
is critical for a bunch of tracing use cases. And having independent
threads implicitly contending with each other just because of uprobe's
internal implementation detail (while conceptually there should be no
dependencies for triggering uprobe from multiple parallel threads) is
a big surprise to users and affects production use cases beyond just
uprobe-handling BPF logic overhead ("useful overhead") they assume.
(Sidenote: I find it kinda interesting that this is sort of going back
in the direction of the old Speculative Page Faults design.)
quoted
quoted
quoted
quoted
quoted
quoted
We rely on newly added mmap_lock_speculation_{start,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.

This allows to avoid contention on mmap_lock in absolutely majority of
cases, nicely improving uprobe/uretprobe scalability.
[...]
quoted
Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not
enough because it's one-way permeable (it's a "RELEASE operation") and
later vma->vm_file store (or any other VMA modification) can move
before our vma->vm_lock_seq store.

This makes vma_start_write() heavier but again, it's write-locking, so
should not be considered a fast path.
With this change we can use the code suggested by Andrii in
https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6k69Q@mail.gmail.com/ (local)
with an additional smp_rmb():

rcu_read_lock()
vma = find_vma(...)
if (!vma) /* bail */
And maybe add some comments like:

/*
 * Load the current VMA lock sequence - we will detect if anyone concurrently
 * locks the VMA after this point.
 * Pairs with smp_wmb() in vma_start_write().
 */
quoted
vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq);
/*
 * Now we just have to detect if the VMA is already locked with its current
 * sequence count.
 *
 * The following load is ordered against the vm_lock_seq load above (using
 * smp_load_acquire() for the load above), and pairs with implicit memory
 * ordering between the mm_lock_seq write in mmap_write_unlock() and the
 * vm_lock_seq write in the next vma_start_write() after that (which can only
 * occur after an mmap_write_lock()).
 */
quoted
mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq);
/* I think vm_lock has to be acquired first to avoid the race */
if (mm_lock_seq == vm_lock_seq)
        /* bail, vma is write-locked */
... perform uprobe lookup logic based on vma->vm_file->f_inode ...
/*
 * Order the speculative accesses above against the following vm_lock_seq
 * recheck.
 */
quoted
smp_rmb();
if (vma->vm_lock_seq != vm_lock_seq)
thanks, will incorporate these comments into the next revision
(As I said on the other thread: Since this now relies on
vma->vm_lock_seq not wrapping back to the same value for correctness,
I'd like to see vma->vm_lock_seq being at least an "unsigned long", or
even better, an atomic64_t... though I realize we don't currently do
that for seqlocks either.)
quoted
        /* bail, VMA might have changed */

The smp_rmb() is needed so that vma->vm_lock_seq load does not get
reordered and moved up before speculation.

I'm CC'ing Jann since he understands memory barriers way better than
me and will keep me honest.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help