Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints
From: Marco Elver <elver@google.com>
Date: 2022-07-20 15:39:47
Also in:
linux-perf-users, linux-sh, lkml
On Wed, 20 Jul 2022 at 17:29, Ian Rogers [off-list ref] wrote:
On Mon, Jul 4, 2022 at 8:06 AM Marco Elver [off-list ref] wrote:quoted
On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.208 [sec] | | 108.422396 usecs/op | 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> --- v2: * Commit message tweaks. --- include/linux/perf_event.h | 3 +- kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++------------- 2 files changed, 37 insertions(+), 22 deletions(-)diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 01231f1d976c..e27360436dc6 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h@@ -36,6 +36,7 @@ struct perf_guest_info_callbacks { }; #ifdef CONFIG_HAVE_HW_BREAKPOINT +#include <linux/rhashtable-types.h> #include <asm/hw_breakpoint.h> #endif@@ -178,7 +179,7 @@ struct hw_perf_event { * creation and event initalization. */ struct arch_hw_breakpoint info; - struct list_head bp_list; + struct rhlist_head bp_list;nit: perhaps it would be more intention revealing here to rename this to bp_hashtable?
The naming convention for uses of rhlist_head appears to be either 'list' or 'node' (also inside lib/rhashtable.c). I think this makes sense because internally this struct is used to just append to the bucket's list.
Acked-by: Ian Rogers <irogers@google.com>
Thanks! -- Marco