Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints

[PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks · Marco Elver <elver@google.com> · 2022-07-04
[PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Dmitry Vyukov <dvyukov@google.com> · 2022-07-04
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Mark Rutland <mark.rutland@arm.com> · 2022-07-21
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Will Deacon <will@kernel.org> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Dmitry Vyukov <dvyukov@google.com> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Will Deacon <will@kernel.org> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Dmitry Vyukov <dvyukov@google.com> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Will Deacon <will@kernel.org> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Dmitry Vyukov <dvyukov@google.com> · 2022-07-22
Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting · Marco Elver <elver@google.com> · 2022-07-25
[PATCH v3 02/14] perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 02/14] perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test · Dmitry Vyukov <dvyukov@google.com> · 2022-07-04
Re: [PATCH v3 02/14] perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 03/14] perf/hw_breakpoint: Clean up headers · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 03/14] perf/hw_breakpoint: Clean up headers · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints · Marco Elver <elver@google.com> · 2022-07-20
[PATCH v3 05/14] perf/hw_breakpoint: Mark data __ro_after_init · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 05/14] perf/hw_breakpoint: Mark data __ro_after_init · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 06/14] perf/hw_breakpoint: Optimize constant number of breakpoint slots · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 06/14] perf/hw_breakpoint: Optimize constant number of breakpoint slots · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 07/14] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 07/14] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 08/14] perf/hw_breakpoint: Remove useless code related to flexible breakpoints · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 08/14] perf/hw_breakpoint: Remove useless code related to flexible breakpoints · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 09/14] powerpc/hw_breakpoint: Avoid relying on caller synchronization · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 09/14] powerpc/hw_breakpoint: Avoid relying on caller synchronization · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 10/14] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 10/14] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 10/14] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · Peter Zijlstra <peterz@infradead.org> · 2022-08-17
Re: [PATCH v3 10/14] locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · Marco Elver <elver@google.com> · 2022-08-29
[PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Peter Zijlstra <peterz@infradead.org> · 2022-08-17
Re: [PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Marco Elver <elver@google.com> · 2022-08-17
Re: [PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Peter Zijlstra <peterz@infradead.org> · 2022-08-29
Re: [PATCH v3 11/14] perf/hw_breakpoint: Reduce contention with large number of tasks · Marco Elver <elver@google.com> · 2022-08-29
[PATCH v3 12/14] perf/hw_breakpoint: Introduce bp_slots_histogram · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 12/14] perf/hw_breakpoint: Introduce bp_slots_histogram · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 13/14] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 13/14] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets · Ian Rogers <irogers@google.com> · 2022-07-20
[PATCH v3 14/14] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets · Marco Elver <elver@google.com> · 2022-07-04
Re: [PATCH v3 14/14] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks · Marco Elver <elver@google.com> · 2022-07-12
Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks · Ian Rogers <irogers@google.com> · 2022-07-20
Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks · Marco Elver <elver@google.com> · 2022-08-16

From: Marco Elver <elver@google.com>
Date: 2022-07-20 15:39:47
Also in: linux-perf-users, linux-sh, lkml

On Wed, 20 Jul 2022 at 17:29, Ian Rogers [off-list ref] wrote:

On Mon, Jul 4, 2022 at 8:06 AM Marco Elver [off-list ref] wrote:

quoted

On a machine with 256 CPUs, running the recently added perf breakpoint
benchmark results in:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

The benchmark tests inherited breakpoint perf events across many
threads.

Looking at a perf profile, we can see that the majority of the time is
spent in various hw_breakpoint.c functions, which execute within the
'nr_bp_mutex' critical sections which then results in contention on that
mutex as well:

    37.27%  [kernel]       [k] osq_lock
    34.92%  [kernel]       [k] mutex_spin_on_owner
    12.15%  [kernel]       [k] toggle_bp_slot
    11.90%  [kernel]       [k] __reserve_bp_slot

The culprit here is task_bp_pinned(), which has a runtime complexity of
O(#tasks) due to storing all task breakpoints in the same list and
iterating through that list looking for a matching task. Clearly, this
does not scale to thousands of tasks.

Instead, make use of the "rhashtable" variant "rhltable" which stores
multiple items with the same key in a list. This results in average
runtime complexity of O(1) for task_bp_pinned().

With the optimization, the benchmark shows:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.208 [sec]
 |
 |      108.422396 usecs/op
 |     6939.033333 usecs/op/cpu

On this particular setup that's a speedup of ~1135x.

While one option would be to make task_struct a breakpoint list node,
this would only further bloat task_struct for infrequently used data.
Furthermore, after all optimizations in this series, there's no evidence
it would result in better performance: later optimizations make the time
spent looking up entries in the hash table negligible (we'll reach the
theoretical ideal performance i.e. no constraints).

Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
---
v2:
* Commit message tweaks.
---
 include/linux/perf_event.h    |  3 +-
 kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 01231f1d976c..e27360436dc6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h

@@ -36,6 +36,7 @@ struct perf_guest_info_callbacks {
 };

 #ifdef CONFIG_HAVE_HW_BREAKPOINT
+#include <linux/rhashtable-types.h>
 #include <asm/hw_breakpoint.h>
 #endif

@@ -178,7 +179,7 @@ struct hw_perf_event {
                         * creation and event initalization.
                         */
                        struct arch_hw_breakpoint       info;
-                       struct list_head                bp_list;
+                       struct rhlist_head              bp_list;

nit: perhaps it would be more intention revealing here to rename this
to bp_hashtable?

The naming convention for uses of rhlist_head appears to be either
'list' or 'node' (also inside lib/rhashtable.c). I think this makes
sense because internally this struct is used to just append to the
bucket's list.

Acked-by: Ian Rogers <irogers@google.com>

Thanks!
-- Marco

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help