Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

[PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 01/12] uprobes: update outdated comment · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-25
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Oleg Nesterov <oleg@redhat.com> · 2024-06-25
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Andrii Nakryiko <hidden> · 2024-06-25
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Oleg Nesterov <oleg@redhat.com> · 2024-06-25
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Andrii Nakryiko <hidden> · 2024-06-26
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() · Oleg Nesterov <oleg@redhat.com> · 2024-06-25
[PATCH 03/12] uprobes: simplify error handling for alloc_uprobe() · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Oleg Nesterov <oleg@redhat.com> · 2024-06-25
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Andrii Nakryiko <hidden> · 2024-06-25
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · kernel test robot <hidden> · 2024-06-26
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Andrii Nakryiko <hidden> · 2024-06-26
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-27
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Andrii Nakryiko <hidden> · 2024-06-27
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management · Andrii Nakryiko <hidden> · 2024-07-01
[PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
Re: [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-27
[PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Jiri Olsa <hidden> · 2024-06-26
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-06-26
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-27
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-06-27
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-28
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-06-28
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-06-29
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-07-01
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-07-01
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-07-02
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-07-02
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-07-02
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Steven Rostedt <rostedt@goodmis.org> · 2024-07-02
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Andrii Nakryiko <hidden> · 2024-07-02
Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs · Masami Hiramatsu (Google) <mhiramat@kernel.org> · 2024-07-02
[PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 09/12] uprobes: batch uprobes_treelock during registration · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25
[PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore · Andrii Nakryiko <andrii@kernel.org> · 2024-06-25

From: Andrii Nakryiko <hidden>
Date: 2024-06-27 16:43:39
Also in: bpf

On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu [off-list ref] wrote:

On Mon, 24 Jun 2024 17:21:36 -0700
Andrii Nakryiko [off-list ref] wrote:

quoted

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
write_lock(&uprobes_treelock), we'll deterministically detect this with
extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
    which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

I have a question here.
Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
again? e.g.

CPU#0                                                   CPU#1

put_uprobe() {
        atomic64_add_return()
                                                        __get_uprobe();
                                                        put_uprobe() {
                                                                kfree(uprobe)
                                                        }
        write_lock(&uprobes_treelock);
        atomic64_read(&uprobe->ref);
}

I think it is very rare case, but I could not find any code to prevent
this scenario.

Yes, I think you are right, great catch!

I concentrated on preventing double kfree() in this situation, and
somehow convinced myself that eager kfree() is fine. But I think I'll
need to delay freeing, probably with RCU. The problem is that we can't
use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
to be a sleepable variant of RCU. I'm thinking of using
rcu_read_lock_trace(), the same variant of RCU we use for sleepable
BPF programs (including sleepable uprobes). srcu might be too heavy
for this.

I'll try a few variants over the next few days and see how the
performance looks.

Thank you,


--
Masami Hiramatsu (Google) [off-list ref]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help